Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual veri...
Occupation coding, an important task in official statistics, refers to coding a respondent's text ans...
International audienceTransformer-based architectures show excellent results on the task of handwrit...
Occupational data are a common source of workplace exposure and socioeconomic information in epidemi...
Machine learning approaches achieve high accuracy for text recognition and are therefore increasingl...
Transcribing the 1950 Norwegian census with 3.3 million person records and linking it to the Central...
This article is part of the "Norwegian Historical Population Register" project financed by the Norwe...
This thesis aims to assist Arkivverket, The National Archival Services of Norway, in automating the ...
The increasing availability of digitised registration records presents a significant opportunity for...
The increasing availability of digitised registration records presents a significant opportunity for...
Interdisciplinary collaboration between two faculty members in the humanities and computer science, ...
This article explains how two projects implement semi-automated transcription routines: for census s...
This collection of data contains ground-truth (gold standard) datasets for the employment status rec...
This training dataset includes a total of 34,913 manually transcribed text segments. It is dedicated...
Accepted to the 7th International Workshop on Historical Document Imaging and Processing (HIP 23)Int...
In this thesis, we investigate techniques for the automatic transcription of handwritten text in dig...
Occupation coding, an important task in official statistics, refers to coding a respondent's text ans...
International audienceTransformer-based architectures show excellent results on the task of handwrit...
Occupational data are a common source of workplace exposure and socioeconomic information in epidemi...
Machine learning approaches achieve high accuracy for text recognition and are therefore increasingl...
Transcribing the 1950 Norwegian census with 3.3 million person records and linking it to the Central...
This article is part of the "Norwegian Historical Population Register" project financed by the Norwe...
This thesis aims to assist Arkivverket, The National Archival Services of Norway, in automating the ...
The increasing availability of digitised registration records presents a significant opportunity for...
The increasing availability of digitised registration records presents a significant opportunity for...
Interdisciplinary collaboration between two faculty members in the humanities and computer science, ...
This article explains how two projects implement semi-automated transcription routines: for census s...
This collection of data contains ground-truth (gold standard) datasets for the employment status rec...
This training dataset includes a total of 34,913 manually transcribed text segments. It is dedicated...
Accepted to the 7th International Workshop on Historical Document Imaging and Processing (HIP 23)Int...
In this thesis, we investigate techniques for the automatic transcription of handwritten text in dig...
Occupation coding, an important task in official statistics, refers to coding a respondent's text ans...
International audienceTransformer-based architectures show excellent results on the task of handwrit...
Occupational data are a common source of workplace exposure and socioeconomic information in epidemi...