Training School: “Optical Character Recognition and Text Encoding for the production of ELTeC”

This Training School is about Optical Character Recognition and Text Encoding for the production of ELTeC (European Literary Text Collection) contributions. The ELTeC will contain a principled sample of European literary production, containing the full text of novels in each of many European languages published between 1850 and 1920.

The aim of the Training School is to enable a group of Action participants to go from a novel in the form of a scanned book to a TEI-encoded full-text version of the novel. The target audience are researchers from participating countries interested in contributing texts to the ELTeC but who need to digitize texts for this purpose and who are insufficiently familiar with the practicalities of Optical Character Recognition for full-text generation and with the fundamentals of using the Guidelines of the Text Encoding Initiative to do so.

All participants are expected to attend both days of the Training School.

Key information

  • Dates: Monday April 16 (all day)  to Tuesday April 17, 2018 (all day)
  • Location: University of Würzburg, Germany
  • Local organizer: Leonard Konle (leonard.konle@uni-wuerzburg.de) and Fotis Jannidis
  • Contact persons: Carolin Odebrecht (carolin.odebrecht@hu-berlin.de), Christof Schöch (schoech@uni-trier.de)
  • Trainers: Christian Reul, Leonard Konle, Lou Burnard
  • Background information: https://github.com/distantreading/WG1/wiki

Programme outline

Monday, April 16, 2018

Location: Philosophische Fakultät, Room 6.E.8, Googlemaps: https://goo.gl/maps/Lx7B3dRRpMs

  • 09:15 Welcome to all participants
  • 09:30 OCR basics
  • 10:30 Hands-on OCR with Abbyy FineReader
  • 12:00 Lunch
  • 13:00 Demo of OCROPUS
  • 15:30: Anna Řehořková, “Digitization practice in the Czech National Corpus” (talk)
  • 15:45 Coffee, cookies and questions
  • 19:30 Dinner at “Alter Kranen”
    Adress: Kranenkai 1, https://goo.gl/maps/iHyVPQCTQPE2

Tuesday, April 17

Location: Philosophische Fakultät, Room 3.E.3, Googlemaps: https://goo.gl/maps/Lx7B3dRRpMs 

  • 09:30 An introduction to the Text Encoding Initiative and to the ELTeC encoding principles
  • 12:00 Lunch
  • 13:00 Practical Work on converting texts from OCR output to XML-TEI and on encoding texts for the ELTeC
  • For details, see: https://distantreading.github.io/Training/programme.html