Some Data Quality Checks in Canadian Census Microdata (1851-1921)

Laurent Richard, Université Laval
Marc St-Hilaire, Département de géographie Université Laval
Richard Marcoux, Université de Laval

As part of The Canadian Peoples / La population canadienne project, teams at Université Laval and University of Saskatchewan georeferenced each of the 40 million individual enumeration records 1852-1921. The precise location of individual records within the framework of census geography allows us to compare the microdata counts to tables published in the official census volumes. The loss of some census schedules understandably leads to some undercounts in the microdata. Unexpectedly, there are also some overcounts, most of which seem to arise from returns describing people who have died or have been crossed out by the enumerator for some other reason. Reconciling the count made by census officers at the time with the counts based on TCP microdata requires extensive checking of the census page images (available on the Library and Archives Canada web site). In addition, we compare the TCP data with an independent transcription of the census records for Quebec City 1851-1911, developed for the Population et histoire social de la ville de Québec (PHSVQ) project. Except for 1881 dataset, the PHSVQ data were entered mainly by Université Laval students and by Société de généalogie de Québec, both having great knowledge of French vocabulary and strings, including transcription of peoples’ names, occupations, etc. Our comparison of first and last names relies on Jaro-Winkler and other distance metrics. While the fit between the two datasets is broadly acceptable, the discrepancies which emerge from our comparison provide a useful dimension to discussions of “Trust and Distrust of Historical Sources in the Digital Age”.

No extended abstract or paper available

 Presented in Session 21. Evaluating Data Quality II