Finalized:Wednesday, May 27, 2015
Author(s):Shanan E. Peters, Miron Livny, Ian Ross, Chris Ré
Large amounts of geoscience data currently reside in the text, tables and figures of scientific publications.
- Many important science questions require synthesizing data from widely scattered and heterogeneous published sources.
- Efforts to date to synthesize literature derived data involve slow and costly manual extraction and then entry into a predefined database structure.
Problem: difficult to find and retrieve data in the published literature.
- Legacy data in all publications need to be made openly accessible to automated text and data mining.
- There is currently no digital library resource that contains all of the necessary published data, despite the fact that many published sources are available online in open access.
- Static databases produced from literature cannot be readily updated, assessed, or augmented with new data.
Demonstration scenario: automated spacetime indexing of published literature in Elsevier holdings via Macrostrat and our new text and data mining (TDM) ready digital library resource.
- Negotiated contractual agreement with Elsevier to develop TDM-ready library that can be used by EarthCube (and other) communities. Other publisher negotiations ongoing.
- Developed secure and high-throughput computing infrastructure to support automated document retrieval and pre-processing with common TDM tools (e.g., natural language parsing, document layout and font recognition
- Working API leveraged by DigitalCrust project
Poster presented at the 2015 All Hands Meeting.
Peters, S., Livny, M., Ross, I., Ré, C. (2015), GeoDeepDive: towards a TDM-ready digital library and machine reading system for the geosciences. Presented at EarthCube All Hands Meeting, Washington, DC, 27-29 May 2015. http://earthcube.org/document/2015/GeoDeepDive-towards-TDM-digital-library-machine-reading