AgStor

From agINFRA

Jump to: navigation, search

Contents

AgStor

Objective

This component allows the data mining and extraction of bibliographic citations from external systems that are then filtered and annotated to the agricultural classifications within the AGRIS linked data framework.

Background

This component was originally envisaged as an agriculturally domain specific version of BioStor, a bibliographic citation repository application developed for biodiversity literature with built-in visualisation features. As agINFRA progressed it became apparent that this idea would simply duplicate other initiatives within the project. AgINFRA has enhanced two other repository applications, AgriDrupal and AgriOcean DSpace, delivering customizations of these two important open source systems that manage documents (based respectively on Drupal and DSpace). Hence, during year two we investigated our options for delivering the envisaged harvesting functionality better aligned with other agINFRA tools and thereby avoiding further duplication of development effort during the project, and maintenance effort thereafter.

This culminated in a formal change to the DoW. The revised output from this component is the harvested and filtered bibliographic citations used to further populate FAO’s AGRIS. Once the references are in AGRIS, they are available to populate the other repositories as well as allowing application developers to craft a visualization component dedicated to the needs of their users, and other bespoke tools as required.

Implementation

The final implementation is a series of scripts that harvest existing non-agricultural repositories to produce files that are then ingested into AGRIS as part of the monthly bulk upload cycle. This exposes the harvested references both through the existing AGRIS interface and through the new OpenAGRIS interface in which the references are automatically converted to RDF for incorporation into linked open data based applications and workflows.

Choice of external data sources

The primary purpose of agStor is to extend the range of relevant literature available for agricultural researchers to search beyond that already actively deposited in AGRIS by agricultural institutions. Our initial aproach was to harvest two data sources: BHL and Mendeley.

BHL - Biodiversity Heritage Library

The BHL (Biodiversity Heritage Library) is a large digital archive of legacy biodiversity literature, comprising (in October 2014) over 44 million pages scanned from books, monographs, and journals. The BHL project began in 2005 when ten natural history museum libraries, botanical libraries, and research institutions in the UK and the USA agreed to collaborate in digitizing their legacy literature [1], with texts dating back as far as the c16th. It now draws on libraries “that cooperate to digitize and make accessible the legacy literature of biodiversity” from all of the inhabited continents [2].

Complementing the public domain literature in their collections, the BHL partners have obtained permission from publishers to digitize and publish significant copyrighted content. In conjunction with the partners’ geographical scope, this makes the BHL a valuable resource of accessible biodiversity literature. This long-term view can prove invaluable in locating wild relatives of crops and understanding their relationship to local habitats and ecosystems.

To maximise the potential of making BHL available to agricultural researchers, however, requires that something of its breadth of coverage is reduced, so that only agriculturally relevant literature is brought into AGRIS. It is this harvesting and filtering, that agStor delivers.

We should also note here, the Bibliography of Life (BoL), because this has been referred to in our earlier reports reporting this component. BoL was developed in the EU FP7 ViBRANT project to provide a comprehensive bibliographic reference collection covering all of life. Its inclusion seemed relevant to our work in agINFRA. Indeed, the BHL harvesters used in agINFRA were developed from those originally developed for ViBRANT; and that was the issue. All agriculturally relevant literature in BoL was obtained from BHL. Therefore, after some analysis, we determined there was no purpose to completing our work to harvest BoL to populate AGRIS, because no new records would be added as they would already be present in AGRIS having been harvested from BHL.

Mendeley

When first preparing our bid submission for the project that is now agINFRA, Mendeley was an emerging bibliographic reference tool. It was at the forefront of those tools adding a social media dimension to bibliographic management. As such, we wanted to introduce this exciting development to agricultural researchers and incorporate the breadth of user contributed - and therefore curated - references into agINFRA’s bibliographic toolset. However, changes at Mendeley (primarily driven by its acquisition by Elsevier we believe) prevented the successful conclusion of this work within agINFRA.

Our problems began when all our existing scripts to harvest Mendeley were first rendered invalid by a change in the authorisation mechanism, necessitating their re-write. At this time, Mendeley seemed to have personnel issues, as support was not forthcoming. Changes were made without notice to the developer community. Documentation for the new ways of working was not produced. There is an extensive archive in the Mendeley developers group of the issues confronting those, like us, who were attempting to integrate Mendeley into their workflows. However, this was followed by more changes, including a new API necessitating yet another re-write, and ultimately a licence change that made our harvesting questionable. In August 2014, having re-written our scripts several times, and in the absence of full support and documentation we decided that even if we re-wrote our scripts yet again, there was a serious question over their sustainability after agINFRA completed. Therefore, reluctantly, we decided to replace Mendeley as a external data source with CORE.

CORE - COnnecting REpositories

To replace Mendeley we selected CORE (COnnecting REpositories) as our external data source. This tool is itself an aggregator, and includes BHL among the sources it harvests. As such it is a good replacement because it draws on a wide range of sources, and includes many of the formal, institutional repositories from which users would select material to include in Mendeley.

CORE is also valuable to agINFRA because using it provides an example of a different harvesting technique. CORE does not rely on exchanging files conforming to a more than decade old XML schema, as is MODS which we use with BHL, but exploits a more recent development in data sharing technology, OAI (Open Archives Initiative). In addition, CORE exposes its records through a SPARQL endpoint as well as a more conventional API offering alternative solutions for a sustainable future.

We had a challenging timetable in starting this work only in August 2014. That we knew the CORE team and could rely on their support, should we need assistance taking our existing CORE harvesting scripts and applying them within agINFRA, was another factor in selecting CORE as Mendeley’s replacement.

Usage and deployment

The scripts are internal to the project as they are used to populate AGRIS which is the public face of this component, not the scripts themselves. The scripts are shared between their developers, the Open University, and their prime users, FAO.

The scripts for harvesting BHL are running. These exploit BHL’s monthly data download to use a local copy of BHL’s index tables. These are searched and filtered for relevant literature. For the identified records the full MODS record as held by BHL is downloaded. (MODS is a common standard to both systems: BHL can export records in this format, and AGRIS can import records in this format.) The downloaded BHL bibliographic records are then enhanced by finding and inserting into them the location of both the PDF and OCRed text copies of the referenced item as held in the Internet Archive. For this, the MODS location and url elements are used, exploiting the latter’s note attribute. See an example of an enhanced MODS record below. This enhanced MODS record is then passed to FAO to be included in its monthly bulk upload of new files into AGRIS. At the time of writing, the relevant references identified in October 2014’s data download from BHL are waiting for the next bulk upload to AGRIS due later this month.

The scripts for harvesting CORE continue to be refined to improve precision without affecting recall. Output from these scripts will be available for the next bulk upload too, and monthly bulk uploads thereafter.

Example Usage Scenarios

Users who benefit from AGRIS:

  • Software developers can re-use the proven and tested bibliographic reference extraction and filtering features available in this component.
  • People working in the agricultural domain can benefit from the wider range of source materials brought into AGRIS. In a similar vein, users of the AGRIS search engine and of Organic.Edunet can benefit from a set of consolidated literature references without recourse to searching each resource separately.

The agINFRA powered AGRIS offers such users single point, quicker and simpler searches for bibliographic references, drawing on records across many repositories, with the ability to refine within search.

  • Librarians looking online for a bibliography on a particular agricultural topic can benefit from the breadth of additional relevant agricultural bibliographic references, because the agINFRA powered AGRIS now includes general and specialist bibliographic references from sources that do not currently contribute to AGRIS.

Before agINFRA

In the related EU FP7 ViBRANT project, a tool was developed to harvest all BHL digital library references. This tool is integrated into the Bibliography of Life workflow, of which RefBank is a key component. The harvester does not discriminate among categories of references. Manual curation of references through additional RefBank functionality provides this facility. As such, it is not suitable for independent use by agronomists. Nor is there any integration with AGRIS.

Researchers and librarians would access their, or their institution’s, preferred repository, which could be based on a variety of platforms such as RefBank, BibServer and Mendeley, and manage their own bibliographic record collection sharing the references with their peers on an ad hoc basis.

Software developers would similarly craft tools that would have to access this variety of possible sources.

After agINFRA

The agINFRA powered version addresses the lack of discrimination by incorporating filters based on the proven [aims.fao.org/agrovoc AGROVOC] vocabulary to filter agriculturally relevant references. The filtered references are automatically propagated to AGRIS, and thereby OpenAGRIS too, for ease of access by the agricultural research community.

For end users of bibliographic resources working in the agricultural domain this means that they benefit from the enhanced curation as well as the wider coverage of material by the agINFRA powered solution.

Similarly a librarian looking online to build a bibliography on a particular agricultural topic can make use of AGRIS’s additional relevant agricultural bibliographic references that are drawn from general and specialist bibliographic references that do not currently contribute to AGRIS. This will improve the breadth of coverage of the librarians’ searches.

Software developers need only access one end point to embrace a wide range of material because it is now available within the AGRIS linked data framework.

APIs

There are no APIs specific to this component developer because its outputs are fully contained within the AGRIS toolset.

The harvested references can be accessed through the traditional AGRIS web interface, the new beta OpenAGRIS web interface or accessed programatically through the listed SPARQL endpoints.

Please see the AGRIS website for more detailed information on how to use AGRIS and OpenAGRIS, including the set of properties for OpenAGRIS.

Sample MODS record

Seen below is a sample BHL bibliographic record enhanced by agStor scripts with links to PDF and text versions of the referenced book.

Take note of the classification tag with an authority attribute of lcc. The tag’s text S499 is the Library of Congress Classification for Agriculture (General). Using this information, FAO are able to properly classify this record in AGRIS because of the AGRIS supported development of mapping between these Library of Congress supplied values and AGROVOC.

For more information see: Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbahndari, S., Jacques, Y., Keizer, J.: The AGROVOC Linked Dataset. Semantic Web. 4(3), 341–348 (2013) http://eprints.rclis.org/20648/, accessed 13 October 2014.

<mods xmlns:xlink="http://www.w3.org/1999/xlink" version="3.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3"
xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-0.xsd">
<titleInfo>
	<title>The farmer's guide : or, A new theory of agriculture, founded on philosophical and practical principles, and adapted to all climates / by James Gaskins.</title>
</titleInfo>
<name type="personal">
	<namePart>Gaskins, James.</namePart>
</name>
<typeOfResource>text</typeOfResource>
<genre authority="marcgt">book</genre>
<originInfo>
	<place>
		<placeTerm type="text">Baltimore :</placeTerm>
	</place>
	<publisher>S. Sands,</publisher>
	<dateIssued>1838.</dateIssued>
	<dateIssued encoding="marc" point="start">1838</dateIssued>
</originInfo>
<language>
	<languageTerm authority="iso639-2b" type="text">English</languageTerm>
</language>
<subject>
	<topic>1838</topic>
</subject>
<subject>
	<topic>Agriculture</topic>
</subject>
<subject>
	<topic>CHR</topic>
</subject>
<subject>
	<genre>Handbooks, manuals, etc</genre>
</subject>
<classification authority="lcc">S499 .G3</classification>
<identifier type="uri">http://www.biodiversitylibrary.org/bibliography/21568</identifier>
<identifier type="lccn">42004301</identifier>
<recordInfo>
	<recordContentSource authority="marcorg">DLC</recordContentSource>
</recordInfo>
<location>
	<url note="pdf">http://www.archive.org/download/farmersguideorne00gaskrich/farmersguideorne00gaskrich.pdf</url>
	<url note="txt">http://www.archive.org/download/farmersguideorne00gaskrich/farmersguideorne00gaskrich_djvu.txt</url>
</location>
</mods>

References

[1] Gwinn, N.E., Rinaldo, C.: The Biodiversity Heritage Library: sharing biodiversity literature with the world. IFLA Journal. 35(1), 25–34 (2009)

[2] BHL–Africa, http://blog.biodiversitylibrary.org/2013/04/making-bhl-africa-reality-bhl-africa.html, accessed 13 October 2014

Personal tools