Human-in-the-loop tabular data extraction methods for historical climate data rescue

Automating tabulated data transcription

May 13, 2025 2 min read data recovery, climate

BlueGreen Labs was happy to co-author and support a research paper on Handwritten Text Recognition (HTR) of tabulated data led by Bas Vercruysse (Vercruysse et al. 2025) and the Center for Digital Humanities at the University of Ghent. This work addresses the many issues with processing large volumes of historical tabulated data with some potential workflows. In particular, the work shows that due to the law of large numbers the variability of multi-author records (handwritten styles) compounded by the variability of layouts this work is not yet feasible in an fully automated fashion. A human in the loop is still required. The full paper is open access and can be found with our other materials on data recovery.

To address some of these workflow issues BlueGreen Labs recently consolidated an older codebase to help in this hybrid workflow, combining manual annotation and machine learning approaches. In particular in the context of the COBECORE project we’ve been building a workflow which relies on template matching, to sidestep layout detection issues, and a variety of machine learning approaches to classify numerical data records. Once data is sorted in various formats, an empty sheet (or one form which content is deleted) is used in combination with a GIMP plugin to outline the structure of a document. The structure of the document contains the rows and columns to be transcribed, or select items in a header. A GIMP plugin was used as the image processing software is free and functional across platforms. Once templates and outlines are annotated we can automatically match similar records to this know, referenced template. This allows for the extraction of individual cells in a table (defined by a row and column location).

After template matching the weaHTR python package can be used to transcribe data using a variety of common HTR frameworks, from the Tesseract over a custom TrOCR implementation to both PyLaia (Transkribus) or Kraken (eScriptorium). Custom docker files are provided to reconcile the different python requirements. Not all frameworks can be run in parallel in the same docker, but will run spread across two images on the same system while retaining the same underlying configuration files and data requirements. In addition, small demo datasets are provided to test the workflow. We hope this small ongoing contribution can help people furhter the transcription of historical data.

References

Vercruysse, B., Birkholz, J.M., Chandrasekar, K.K.T. et al. Human-in-the-loop tabular data extraction methods for historical climate data rescue. IJDAR (2025). https://doi.org/10.1007/s10032-025-00524-y

climate data recovery research literature

Koen Hufkens, PhD

Founder, Researcher

As an earth system scientist and ecologist I model ecosystem processes.

Human-in-the-loop tabular data extraction methods for historical climate data rescue

References

Koen Hufkens, PhD

Founder, Researcher

Related