We are happy to announce today that a new version of the European Literary Text Collection (ELTeC) has been released a short while ago! This new release, version number v1.1.0, contains several new collections, many additional novels in existing collections, three collections with versions that feature linguistic annotation, and many other improvements. ELTeC remains work in progress, so we will continue to expand and improve it.
The latest release contains 14 different corpora with at least 50 novels each and a total of more than 1200 novels. Among these corpora, 8 are complete, containing 100 novels: Czech, German, English, French, Hungarian, Polish, Portuguese and Slovenian. Another 6 corpora contain at least 50 novels, sometimes considerably more: Norwegian, Romanian, Serbian, Spanish, Swedish and Ukrainian. Finally, there are 3 complete corpora that also provide versions with linguistic annotation: Hungarian, Portuguese and Slovenian.
ELTeC collections are curated on Github, where you have fine-grained access to the texts. Releases of ELTeC corpora are also published on Zenodo, where long-term availability and findeability is guaranteed. The following image shows the state of all collections in April 2021.
If you’re using ELTeC in your teaching or research, please remember to acknowledge the contributors’ work by using the following citation suggestion, for ELTeC as a whole (or one of the corpus-specific citation suggestions you find on Github and Zenodo): European Literary Text Collection (ELTeC), version 1.1.0, April 2021, edited by Carolin Odebrecht, Lou Burnard and Christof Schöch. COST Action Distant Reading for European Literary History (CA16204). DOI: doi.org/10.5281/zenodo.4662444).
Background information: Ultimately, the ELTeC core will contain at least 10 linguistically annotated subcollections of 100 novels comparable in their internal structure in at least 10 different European languages, totalling at least 1,000 full-text novels. The extended ELTeC will take the total number of full-text novels to at least 2,500. Novels have been chosen among major literary genres for availability and size. Chronological limits are due to constraints related to copyright and availability of quality full texts.
For more information on ELTeC, see the following places:
- An overview of the current state in ELTeC corpus building can be found here: https://distantreading.github.io/ELTeC/
- Work on the different ELTeC collections is in progress here: https://github.com/COST-ELTeC
- A collection of relevant documentation can be found here: https://distantreading.github.io/
- ELTeC page on Zenodo, with archived releases: https://zenodo.org/communities/eltec/