About the Project


Our journey began with a fundamental question: How can we make digitized text collections, often riddled with additional publisher-generated material, more accessible and useful for researchers? In the age of digital libraries, large-scale scanned texts pose unique challenges, with much of the content cluttered by peritextual elements—advertisements, publisher introductions, and other non-core content. These added layers can obstruct scholars' ability to analyze primary material effectively. Our project is set on revolutionizing how scholars interact with these collections by introducing automation that filters core text from auxiliary material. This innovative approach aims to redefine digital text analysis, enabling precise and streamlined exploration of digitized works across disciplines.

Peritext example from a sample book.

Our research has focused on distinguishing between 'paratext' and 'core text,' a concept rooted in literary theory. Paratext encompasses all elements surrounding the main work—tables of contents, critical reviews, bibliographies, and promotional material. By accurately identifying these boundaries, our project ensures that analyses focus solely on authorial content, minimizing distractions from extraneous publisher-created content. We explore questions that are foundational for humanities research in the digital age: Which parts of a work should be highlighted, and which can be set aside? This meticulous filtering is more than just a technical necessity; it enables scholars to focus on primary content, enhancing the accuracy of text-based research.

Currently, digital libraries rarely distinguish between these different layers, leading to research distortion. Imagine a literary study that misidentifies the publisher’s address in the bibliography as a location referenced in a novel or a topic analysis skewed by repeated mentions in promotional sections. Our project takes this problem head-on, with a robust dataset and automated classification tools that analyze, parse, and structure text according to researcher needs. These tools will not only filter out unrelated peritext material but also enable large-scale analysis with newfound precision. The dataset itself comprises 1,000 carefully curated texts annotated to delineate core and peripheral content. This resource is fundamental to our work and promises to advance research capabilities significantly for both individual scholars and larger institutions.

Peritext example from a sample book.

An environmental scan shows that this paratext problem is longstanding. From early digitization projects like Project Gutenberg to the expansive HathiTrust Digital Library, the challenge of differentiating core from non-core material has persisted. While certain methods—such as frequency analysis—exist, they often fall short in managing complex digitized documents where the distinction between paratext and core text is blurred. Through innovative machine learning, our project builds upon previous approaches, aiming for a refined, scalable solution that respects the unique structural nuances of each text. By doing so, we allow researchers to explore works with clarity and accuracy that were previously out of reach.

We are proud of the interdisciplinary team driving this project forward. Led by experts in digital humanities, library science, and computational linguistics, our team includes consultants and advisory board members from leading research institutions. Together, we are committed to fostering a new era of research capabilities in digital libraries. This collaboration also supports emerging scholars through hands-on roles, providing students with invaluable experience in text analysis, machine learning, and data structuring.

Looking to the future, we are developing an accessible visualization tool for digital libraries, enabling users to explore structured text with unprecedented ease. Our final products will include a publicly available dataset, an open-source classification tool, and visualization software that highlights text structures within large digital collections. These resources promise to benefit anyone working in fields where accurate text analysis is essential, from literary studies and linguistics to social sciences. We aim to publish our results and release our tools through open channels, ensuring that researchers worldwide can build upon and benefit from our work. Our mission is to pave the way for more insightful, scalable, and precise research across digital humanities.