Published December 2021 | Version v1
Thesis Open

Linking Scientific Research: Extracting Sentences Containing Dataset Information

Creators

  • 1. University of Chicago

Contributors

Committee member:

Description

Despite the popularity of data-driven research in scientific fields, we are intrigued by the combined value of datasets in a given area. Our research seeks to establish strategies for retrieving words containing dataset information from academic publications using a specific example of COVID-19 epidemiological papers, which was encouraged by previous studies concerning research originality and how combinatorial work improves science. We deployed LDA and word embedding algorithms to filter epidemiological papers versus clinical ones. We also annotated sentences based on whether each sentence in the abstract and title parts mentions dataset information. Pre-trained word representations enabled classification models to discriminate between data and non-data sentences. The unexpected finding is that, while more diverse terms in a publication's abstract and title help advertise it in terms of citation, they make this document less likely to be one of the top-cited papers. In conclusion, while we have not reached accurate conclusions for identifying data sentences in papers, we have uncovered techniques for filtering possible data sentences. We suggest inspecting a larger corpus in the next stage to evaluate the impact of alternative datasets and gather more information for the paper's word representations and citation.

Files

MACSS_Thesis_Yilun_XU_Final_Version.pdf

Files (737.4 kB)

Name Size Download all
md5:25045688d6ea3868d71f0462e256ed50
737.4 kB Preview Download

Additional details

Identifiers

Other
oai:uchicago.tind.io:3491

UChicago Information

Division(s)
Social Sciences Division
Department(s)
Computational Social Sciences (MACSS)