The AgroTagger is a highly experimental metadata extraction, thematic classification and indexing system. The system is mainly based on two components:
- A customized implementation of Apache Nutch Web Crawler, which can be used to crawl the Web starting from a list of Web sites (i.e. URLs)
- The AgroTagger component, a tool based on MAUI that allows to index Web documents (like PDF files, HTML pages, text files, etc.) identifying main topics and creating RDF triples that link the URL of a document to some AGROVOC URIs
The scope of the application is to create a workflow:
- Starting from a list of URLs of pre-selected Web resources
- An Apache Nutch Web Crawler crawls the Web
- The Agrotagger assigns Agrovoc URIs to each discovered resource
- Triples are produced and stored in a triplestore
The output of the current application is RDF/XML.
Usage and deployment (if publicly accessible)
The AgroTagger is available under different forms:
- A downloadable software tool: https://github.com/agrisfao/agrotagger
- A grid workflow executable through the INFN Science Gateway (http://aginfra-sg.ct.infn.it/applications#)
- Direct link, accessible only with grid credentials: https://aginfra-sg.ct.infn.it/run-agrovoc-tagging
- A REST API (see the agINFRA REST APIs wiki page) calling the grid workflow
Example Usage Scenario
Our example user is Helga, a webmaster for a CGIAR center. The center has a mixture of GIS, statistical and research documents and much of it is on the Web site.
The chief of Helga’s communication and outreach division is not happy with the center’s current Web presence. Although they produce many excellent outputs, users complain that they cannot find them and that it is difficult to understand how outputs from different domains are connected. She tries to begin by indexing some of the research documents but the process is simply too inefficient.
agINFRA powered version
Helga uses agINFRA’s powerful crawling and indexing capabilities. By pointing the agINFRA AgroTagger at her Web site and defining a crawling depth and the crawling frequency, she is able to regularly receive a complete and accurate index of the entire center’s site. The system returns AGROVOC URIs along with the source URL for every page and resource she manages. She stores this information in a Triple store database and uses it to link together the diverse information resources in here system.
See the agINFRA REST APIs page