The AgroTagger is a highly experimental metadata extraction, thematic classification and indexing system. The system is mainly based on two components:

  • A customized implementation of Apache Nutch Web Crawler, which can be used to crawl the Web starting from a list of Web sites (i.e. URLs)
  • The AgroTagger component, a tool based on MAUI that allows to index Web documents (like PDF files, HTML pages, text files, etc.) identifying main topics and creating RDF triples that link the URL of a document to some AGROVOC URIs

The scope of the application is to create a workflow:

  • Starting from a list of URLs of pre-selected Web resources
  • An Apache Nutch Web Crawler crawls the Web
  • The Agrotagger assigns Agrovoc URIs to each discovered resource
  • Triples are produced and stored in a triplestore

The output of the current application is RDF/XML.

Responsible body(ies):

  • FAO for the software and for AGROVOC
  • INFN for the grid application
  • IPB for the REST API

Usage and deployment (if publicly accessible)

The AgroTagger is available under different forms:

Example Usage Scenario

Our Users

Our example user is Helga, a webmaster for a CGIAR center. The center has a mixture of GIS, statistical and research documents and much of it is on the Web site.

Before agINFRA

The chief of Helga’s communication and outreach division is not happy with the center’s current Web presence. Although they produce many excellent outputs, users complain that they cannot find them and that it is difficult to understand how outputs from different domains are connected. She tries to begin by indexing some of the research documents but the process is simply too inefficient.

agINFRA powered version

Helga uses agINFRA’s powerful crawling and indexing capabilities. By pointing the agINFRA AgroTagger at her Web site and defining a crawling depth and the crawling frequency, she is able to regularly receive a complete and accurate index of the entire center’s site. The system returns AGROVOC URIs along with the source URL for every page and resource she manages. She stores this information in a Triple store database and uses it to link together the diverse information resources in here system.


