Human-in-the-loop tabular data extraction methods for historical climate data rescue

Abstract

Historical meteorological data is necessary for modeling climate scenarios, yet there remain significant challenges in archiving, extracting relevant data, organization and provision. Data rescue efforts play an important role in collecting relevant information, yet the real challenge remains in the implementation of accurate and efficient methods for extracting information from historical handwritten tabular records which are largely recorded in handwritten logbooks. The diverse tabular structures, layouts, and writing styles in these documents, along with the accessibility and varying quality of source preservation, make it impractical to rely on a single solution. In an effort to fill this gap, we propose a human-in-the-loop workflow with benchmarks of open- and closed-source methods for extracting and processing handwritten tabular data. We explain through the case of climatological data from the Congo region (1907–1960) how HIL workflows can be implemented and their trade-offs for implementing solely computationally steered models. In addition, we outline how generalized large Vision Language models have changed how HIL workflows can be implemented as to increase accuracy and precision not only with semi-automatic solutions for data provision but also prompt engineering and the necessary (historical) context to achieve optimal results.

Publication
International Journal on Document Analysis and Recognition, (2025)
history climate data data recovery HTR OCR
Avatar
Koen Hufkens, PhD
Founder, Researcher

As an earth system scientist and ecologist I model ecosystem processes.

Next
Previous