Automated Metadata Extraction Can Make Data Swamps More Navigable

Skluzacek, Tyler J.

doi:10.6082/uchicago.4760

Published August 2022 | Version v1

Dissertation Open

Automated Metadata Extraction Can Make Data Swamps More Navigable

Skluzacek, Tyler J.¹

1. University of Chicago

Contributors

Advisors:

Committee members:

In a science utopia, every research repository would be accompanied by a database of rich, searchable metadata that users can quickly and confidently query to discover, retrieve, and organize the many artifacts of research workflows. In practice, science is far from this utopia; repositories commonly decay into disorganized data swamps that overwhelm scientists and result in crucial research data being inaccessible to those who could use them. To dredge data swamps, I describe an automated metadata extraction system for science---Xtract---that crawls large repositories, dynamically constructs extraction workflows by intelligently mapping extractors to diverse file types, scalably executes these workflows on distributed research cyberinfrastructure, and publishes the derived metadata into a search index. I show via a user study that an Xtract-generated search index drastically increases the speed and confidence with which researchers navigate their science collections. Finally, I highlight the benefits of this approach by applying Xtract to real-world repositories collectively spanning over 6 million files and 1PB of data across materials science, climate science, battery modeling, and spectroscopy repositories.

Files

Skluzacek_uchicago_0330D_16427.pdf

Files (15.6 MB)

Name	Size	Download all
Skluzacek_uchicago_0330D_16427.pdf md5:be3219281d7a8ed17496851c211b8d00	15.6 MB	Preview Download

Additional details

Other: oai:uchicago.tind.io:4760

Division(s): Physical Sciences Division
Department(s): Computer Science

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Automated Metadata Extraction Can Make Data Swamps More Navigable

Contributors

Advisors:

Committee members:

Files

Skluzacek_uchicago_0330D_16427.pdf

Files (15.6 MB)

Additional details

Identifiers

UChicago Information

Automated Metadata Extraction Can Make Data Swamps More Navigable

Creators

Contributors

Advisors:

Committee members:

Description

Files

Skluzacek_uchicago_0330D_16427.pdf

Files (15.6 MB)

Additional details

Identifiers

UChicago Information