Share Email Print
cover

Proceedings Paper

Software workflow for the automatic tagging of medieval manuscript images (SWATI)
Author(s): Swati Chandna; Danah Tonne; Thomas Jejkal; Rainer Stotzka; Celia Krause; Philipp Vanscheidt; Hannah Busch; Ajinkya Prabhune
Format Member Price Non-Member Price
PDF $14.40 $18.00

Paper Abstract

Digital methods, tools and algorithms are gaining in importance for the analysis of digitized manuscript collections in the arts and humanities. One example is the BMBF-funded research project “eCodicology” which aims to design, evaluate and optimize algorithms for the automatic identification of macro- and micro-structural layout features of medieval manuscripts. The main goal of this research project is to provide better insights into high-dimensional datasets of medieval manuscripts for humanities scholars. The heterogeneous nature and size of the humanities data and the need to create a database of automatically extracted reproducible features for better statistical and visual analysis are the main challenges in designing a workflow for the arts and humanities. This paper presents a concept of a workflow for the automatic tagging of medieval manuscripts. As a starting point, the workflow uses medieval manuscripts digitized within the scope of the project Virtual Scriptorium St. Matthias". Firstly, these digitized manuscripts are ingested into a data repository. Secondly, specific algorithms are adapted or designed for the identification of macro- and micro-structural layout elements like page size, writing space, number of lines etc. And lastly, a statistical analysis and scientific evaluation of the manuscripts groups are performed. The workflow is designed generically to process large amounts of data automatically with any desired algorithm for feature extraction. As a result, a database of objectified and reproducible features is created which helps to analyze and visualize hidden relationships of around 170,000 pages. The workflow shows the potential of automatic image analysis by enabling the processing of a single page in less than a minute. Furthermore, the accuracy tests of the workflow on a small set of manuscripts with respect to features like page size and text areas show that automatic and manual analysis are comparable. The usage of a computer cluster will allow the highly performant processing of large amounts of data. The software framework itself will be integrated as a service into the DARIAH infrastructure to make it adaptable for wider range of communities.

Paper Details

Date Published: 8 February 2015
PDF: 11 pages
Proc. SPIE 9402, Document Recognition and Retrieval XXII, 940206 (8 February 2015); doi: 10.1117/12.2076124
Show Author Affiliations
Swati Chandna, Karlsruher Institut für Technologie (Germany)
Danah Tonne, Karlsruher Institut für Technologie (Germany)
Thomas Jejkal, Karlsruher Institut für Technologie (Germany)
Rainer Stotzka, Karlsruher Institut für Technologie (Germany)
Celia Krause, Technische Univ. Darmstadt (Germany)
Philipp Vanscheidt, Univ. Trier (Germany)
Hannah Busch, Univ. Trier (Germany)
Ajinkya Prabhune, Karlsruher Institut für Technologie (Germany)


Published in SPIE Proceedings Vol. 9402:
Document Recognition and Retrieval XXII
Eric K. Ringger; Bart Lamiroy, Editor(s)

© SPIE. Terms of Use
Back to Top