Sam Hwang, University of British Columbia
Christian Møller Dahl, University of Southern Denmark
Torben Johansen, University of Southern Denmark
Munir Squires, University of British Columbia
Digitized historical documents have become an increasingly important source of research. These datasets have allowed economists to break ground on important issues as diverse as immigration, intergenerational mobility, culture, and discrimination. However, there is empirical evidence indicating that these digitized datasets exhibit a high incidence of measurement errors. For instance, Ghosh et al. (2023) found that 28 percent of the records in the 1940 U.S. Federal Census have transcription errors, and Hwang and Squires (2024) estimated that up to 47 percent of measurement errors in the U.S. Federal Censuses between 1850 and 1930 can be attributed to transcription errors. In this project, we rectify these transcription errors by employing state-of-the-art OCR technology in conjunction with two independently transcribed but error-laden full-count census datasets. In contrast to LayoutParser (Shen et al., 2021), this OCR technology, developed by Dahl et al. (2022, 2023), is specifically designed for segmenting and transcribing large collections of dense tabular documents. Additionally, the technology utilizes the latest deep-learning techniques, including Segment Anything (Kirillov et al., 2023) and SegFormer (Xie et al., 2021), for table segmentation and cropping, as well as sequence-to-sequence (seq2seq) transformer-based architectures (Dahl et al., 2023) for table cell transcription. We will train our OCR model with the census manuscript image and records for which the two transcriptions are in agreement. Our error-corrected census data can significantly reduce biases in downstream analyses.
No extended abstract or paper available
Presented in Session 122. Designing and Evaluating Data Pipelines