Mengyue Zhao, University of Oxford
In the evolving landscape of historical and social science research, questions about data reliability have become more prominent, this study pioneers a novel approach to bolstering trust in historical records through the application of deep learning and generative AI. How do we go through 15 million pages of U.S. Census records to correct 1-2 million "extra persons"? In partnership with the Minnesota Population Center, my research addresses the broader challenge of transcription errors in historical documents, which significantly undermine data accuracy. These errors, including common misinterpretations such as "No One" mistaken for "Noah" and "Vacant" for "Vincent," reflect deeper issues of trust and verification in the historical record. By deploying deep learning algorithms, we successfully identify and correct these inaccuracies, achieving a remarkable 95% accuracy rate for a large test dataset. This approach aims to eliminate 1-2 million erroneous “extra persons” from the analysis, significantly enhancing the dataset's reliability. Additionally, my use of generative AI to link complex individual names across varied documents tackles the challenges of inclusion and exclusion in historical narratives, by recognizing and associating diverse name representations. This approach not only improves the precision of data analysis across historical records to an impressive 93.5 % but also enables the identification and association of diverse name representations that may have been overlooked or misrepresented within traditional archival practices. This work underscores the transformative impact of Large Language Models and deep learning technologies in advancing social sciences and humanities research. By applying these technologies to historical document analysis, my study enhances the reliability and accuracy of historical records, contributing to discussions on identity, kinship, and community in the digital age. This innovative effort has wide implications for scholars in the social sciences, setting new benchmarks for integrating advanced technology into historical research.
Presented in Session 194. Implications of New Techniques on Data Infrastructure