Matthew Sobek, University of Minnesota
Matt A. Nelson, University of Minnesota
The 1950 U.S. census manuscript images were released to the public in 2022. As with earlier U.S. censuses, IPUMS partnered with Ancestry.com to convert the full 152-million-person enumeration into a scientifically useful population database: cleaning, coding, and documenting the data and distributing it through the IPUMS web dissemination system. A preliminary public use dataset was released by IPUMS in early 2024, but much work remains to be done. We welcome the opportunity to receive feedback from the user community as we address these issues. The 1950 full count census database poses a variety of challenges. It is the largest dataset IPUMS has ever processed, with the attendant difficulties one would expect from sheer scale. There is no prospect of hand-correcting errant cases, and even basic diagnostic tests are cumbersome. But the most complicated issues stem from the unique transcription method. The data are largely the product of computerized optical character recognition from the image files, in contrast to all earlier U.S. census databases, which were typed in. The data capture method and state of the source material yield many challenges. The original data transmitted to IPUMS had over two million phantom records: mostly vacant households where words written on the form can look a lot like a person. Ancestry removed many such cases and IPUMS cut the number in half again, but nearly a million remain. Income data are uniquely challenging, because office workers wrote values expressed in hundreds of dollars in the same field as the original entries. The OCR process might record either number, with or without decimals. Institution information was written in the header on the form where it was subject to poor data capture. And all the variables exhibit more spelling variation and strange character and numeric combinations than manual data entry would produce.
No extended abstract or paper available
Presented in Session 122. Designing and Evaluating Data Pipelines