Groupes de travail > GdT IDDC 28 Juin 2012

Analyse de documents historiques

GdT organisé par Muriel Visani et Jean-Marc Ogier

28 Juin 2012

Handwriting Recognition in Historical Documents.
Andreas Fischer, Université de Fribourg, Suisse

  • Abstract :

This talk addresses recent advances in pattern recognition methods for handwriting recognition in historical documents. The aim of these methods is to automatically extract textual content from digitized manuscript images. Based on their textual content, millions of historical manuscript images could be integrated in digital libraries, which would help to preserve our cultural heritage by making it readily accessible to researchers and the public.

Two state-of-the-art strategies are discussed to model and recognize characters, words, and sentences. First, a generative strategy using hidden Markov models (HMM) and secondly, a discriminative strategy using a special form of recurrent neural networks (NN). The learning-based systems are generic in the sense that they can learn character appearance models for arbitrary alphabetical languages as long as a number of training samples are provided. They operate at the level of text lines avoiding prior word and character segmentation which is prone to errors for touching characters, broken characters, variable word spacing, and difficult image conditions stemming, e.g., from paper texture, damaged parchment, faded ink, and ink bleed-through.

Four subproblems of handwriting recognition in historical documents are addressed in this talk, namely ground truth creation, automatic transcription, keyword spotting, and transcription alignment. Experimental results are presented for several historical scripts and languages. The IAM historical document database (IAM-HistDB) includes Latin texts from the 9th century written in Carolingian minuscules (Saint Gall database), medieval German texts from the 13th century written in Gothic minuscules (Parzival database), and longhand English texts from the 18th century (George Washington database). The experimental results are promising in terms of accuracy, speed, and costs for indexing historical documents in digital libraries.

  • References :

[1] A. Fischer, V. Frinken, and H. Bunke. Application of hidden Markov models for handwriting recognition. To appear in Handbook of Statistics, volume 31. Elsevier, 2012.
[2] A. Fischer, A. Keller, V. Frinken, and H. Bunke. Lexicon-free handwritten word spotting using character HMMs. Pattern Recognition Letters,33(7):934–942, 2012.
[3] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke. A novel word spotting method based on recurrent neural networks. IEEE Trans. PAMI, 34(2):211–224, 2012.

Platforms for Document Image Processing
Rafael Dueire Lins, Universidade Federal de Pernambuco, Brésil

  • Abstract :

This seminar provides a brief overview of the platforms for document image processing
developed by the presenter and his research collaborators.

  • CV :

Rafael Dueire Lins holds a B.Sc. degree in Electrical Engineering (Electronics) from the
Federal University of Pernambuco, Brazil (1982) and a Ph.D. degree in Computing from the University of Kent
at Canterbury, UK (1986). Lins published 10 books, amongst them the best-seller "Garbage Collection : Algorithms for Dynamic Memory
Management", (John Wiley & Sons , UK,1996) translated into Chinese (Mandarin) and published by ChinaPub in 2004.
His pioneering contributions encompass the creation of the Lambda-Calculus with explicit substitutions, the first general and efficient
solution to cyclic reference counting in sequential, parallel and distributed architectures.
Lins was one of the pioneer researchers in document engineering and digital libraries in Latin America. In this area, he was the first
to address the problem of back-to-front interference (bleeding) in documents in 1993.
Lins supervised 44 M.Sc dissertations and 7 Ph.D. theses in computer science and electrical engineering.
He published 32 papers in refereed journals and over one hundred articles in international conferences.

publie le vendredi 29 juin 2012