OCR Clean-up

From agINFRA

Jump to: navigation, search

Contents

OCR Clean-up

The features of this component are available in agINFRA through the agTextMining and agTagger services implemented by WP3. See D3.2.3 for details.

Introduction

Objective

This was proposed as a document processing component to post-process documents (literature/documents that have already been scanned) in order to refine/improve OCR accuracy, as well as extract additional metadata that have not been provided by the original contributor/annotator (such as the structure of the document, or additional annotation such as identifying the research method, variables and results described in a scientific study). It allows application developers to add a scientific literature post-analysis/processing component/service to extract additional semantic information/annotations for targeted thematic communities.

The original intention, in line with agINFRA’s vision to exploit existing technology, was to use ComTax to deliver post-processing refinement of scanned documents. However, this highly interactive tool was found not to be suitable for incorporation into the agINFRA workflow as that workflow developed and evolved during the project. Therefore, an alternative approach to delivering post-processing was investigated and piloted, as described below.

Pilot

Our pilot project reworked the GoldenGATE interactive mark-up tool into discrete web services. These web services are available as standalone RESTful API calls and also through the Oxford Batch Operation Engine (OBOE) service developed under the EU FP7 ViBRANT project.

The OBOE service provides for ViBRANT some of the functionality that the Grid and Cloud provides to the agINFRA project. Therefore OBOE provides easy access to computing services such as software applications that can be run locally, but actually reside elsewhere in the Cloud. OBOE is also similar in concept to the agINFRA Science Gateway in that it provides access to a collection of computationally intensive services of broad interest. The range of services continues to expand, even though ViBRANT has finished, as new services are added.

Pilot results

Test and evaluation services for this component were deployed. These showed that without re-working the algorithms within each service there is little to be gained from deploying component services on the Grid. Unfortunately, there is no parallelism within the algorithms to exploit the Grid’s potential to increase process throughput.

In discussion with IPB, who host agINFRA services, the sensible way for GoldenGATE web services to exploit the Grid is to arrange for parallelization at the document level. What this means in practice is that documents would be submitted to the GoldenGATE web services in batches so that many documents are processed simultaneously by multiple processing nodes on the grid. This type of offering could benefit librarians and managers of document repositories but for individual agricultural researchers, who typically work on one document at a time, this approach to deploying the GoldenGATE web services has no benefit.

Pilot conclusions

Therefore, as interesting as it would be to the computing and data researchers engaged in agINFRA to address these issues, recognising that the overall aim of the project is to improve the working practices of agricultural researchers, this component was re-envisaged. The motivation for improving OCR accuracy was to enable better information extraction from the documents. Rather than continue to try to improve OCR accuracy per se, our attention would move to the information extraction services directly as delivered in WP3, specifically the agTextMining and agTagger services.

Usage and deployment

Once literature has been OCR-ed there is little that can be done to improve the accuracy of the text obtained that does not involve manual intervention at some point. Our original concept for this component has had to be modified during the project. Following evaluation and piloting neither ComTax (unsuitable) nor GoldenGATE (no benefit) are deployed within agINFRA. In recognition of these constraints this component has not been realised as originally envisaged, deploying semi-automated tools, but rather to exploit the utility of the agTextMining and agTagging services. These service are described as part of D3.2.3.

Example Usage Scenario

Potential users of the software include:

  • People working on exploiting OCRed legacy text (software developers, information manager specialist). Legacy texts such as those available from the Biodiversity Heritage Library can contain errors due to a number of reasons such as the condition of the original document, the scanning process, the printing typeface. These all affect the accuracy of the final OCRed text and any errors in the text affect the ability to accurately search within it for e.g. species names. An application developer can address common OCR induced errors through using the OCR post-processing service.
  • People working to identify structural elements, metadata, and bibliographic references in a document to aid later information retrieval. Bibliographic aggregators may be able to benefit from additional, relevant keywords identified in the document content to improve the search facilities they offer their users.

Before agINFRA

Using GoldenGATE, a monolithic desktop semi-automated annotation application whose development was originally supported by Plazi, an agronomist would manually curate an individual OCRed document.

agINFRA powered version

The individual GoldenGATE modules have been liberated as tailored web applications addressing specific functions such as tagging semantic elements within texts. This enhanced version exposes the web applications to a wider audience of researchers and developers, enabling them to re-use existing tools within their applications rather than develop their own. This work was in collaboration with the ViBRANT project, and the resulting web applications are available through ViBRANT. Now a developer can tailor a workflow to use the specific modules required by an agricultural scientist when working on their document collection.

To complement the continuing GoldenGATE web services, agINFRA offers agTextMining and agTagger which can extract title, author, references and keywords from texts. These are available over the Grid, and developers can seamlessly integrate them into workflows alongside the existing GoldenGATE services.

APIs

For use of agINFRA tools available in the Grid, see D3.2.3.

Personal tools