Distant Reading Compendium

Welcome to the Distant Reading Compendium, edited by X, Y and Z. This virtual edited volume unites contributions that have emerged from the COST Action Distant Reading for European Literary History.

Front matter

Introduction to the volume

This introduction present the COST Action Distant Reading for European Literature and its key output, the European Literary Text Collection (ELTeC). It structures and summarizes the key findings reported on in the various publications that make up this virtual edited volume of publications that have been created by the Action members in the period 2018-2022.

COST, Distant Reading, ELTeC

URL:
DOI:
PDF:

@article{schoech_introduction_2022,
     title = {Introduction},
     doi = {https://doi.org/xyz},
     language = {eng},
     journal = {Zenodo},
     author = {Schöch, Christof and Eder, Maciej and Odebrecht, Carolin and Byszuk, Joanna and Arias, Rosaria and Mihurko Poniz, Katja},
 }

Section 1: Building ELTeC

Chapter 1.1: In Search of Comity: TEI for Distant Reading

Authors: Lou Burnard, Christof Schöch, Carolin Odebrecht.

Any expansion of the TEI beyond its traditional user base involves a recognition that there are many differing answers to the traditional question “What is text, really?” We report on some work carried out in the context of the COST Action Distant Reading for European Literary History (CA16204), in particular on the TEI-conformant schemas developed for one of its principal deliverables: the European Literary Text Collection (ELTeC). – The ELTeC will contain comparable corpora for each of at least a dozen European languages, each being a balanced sample of one hundred novels from the period 1840 to 1920, together with metadata concerning their production and reception. We hope that it will become a reliable basis for comparative work in data-driven textual analytics. – The focus of the ELTeC encoding scheme is not to represent texts in all their original complexity, nor to duplicate the work of scholarly editors. Instead, we aim to facilitate a richer and better-informed distant reading than a transcription of lexical content alone would permit. At the same time, where the TEI encourages diversity, we enforce consistency by permitting representation of only a specific and quite small set of textual features, both structural and analytical. These constraints are expressed by a master TEI ODD, from which we derive three different schemas by ODD chaining, each associated with appropriate documentation.

distant reading, ELTeC, ODD chaining, corpus design, the European novel, literary studies

URL: https://journals.openedition.org/jtei/3500
DOI: https://doi.org/10.4000/jtei.3500
PDF:

Reference: Burnard, Lou, Christof Schöch, and Carolin Odebrecht, ‘In Search of Comity: TEI for Distant Reading’, Journal of the Text Encoding Initiative 14, 2021. DOI: https://doi.org/10.4000/jtei.3500.

@article{burnard_search_nodate,
     title = {In {Search} of {Comity}: {TEI} for {Distant} {Reading}},
     doi = {https://doi.org/10.5281/zenodo.3552488 (preprint)},
     language = {eng},
     journal = {Journal of the Text Encoding Initiative},
     author = {Burnard, Lou and Odebrecht, Carolin and Schöch, Christof},
 }

Chapter 1.2: Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives

Authors: Christof Schöch, Roxana Patraș, Diana Santos, Tomaž Erjavec.

The aim of this contribution is to reflect on the process of building the multilingual European Literary Text Collection (ELTeC) that is being created in the framework of the networking project Distant Reading for European Literary History funded by COST (European Cooperation in Science and Technology). To provide some background, we briefly introduce the basic idea of ELTeC with a focus on the overall goals and intended usage scenarios. We then describe the collection composition principles that we have derived from the usage scenarios. In our discussion of the corpus-building process, we focus on collections of novels from four different literary traditions as components of ELTeC: French, Portuguese, Romanian, and Slovenian, selected from the more than twenty collections that are currently in preparation. For each collection, we describe some of the challenges we have encountered and the solutions developed while building ELTeC. In each case, the literary tradition, the history of the language, the current state of digitization of cultural heritage, the resources available locally, and the scholars’ training level with regard to digitization and corpus building have been vastly different. How can we, in this context, hope to build comparable collections of novels that can usefully be integrated into a multilingual resource such as ELTeC and used in Distant Reading research? Based on our individual and collective experience with contributing to ELTeC, we end this contribution with some lessons learned regarding collaborative, multilingual corpus building.

Keyword1, Keyword2, Keyword3

DOI: http://doi.org/10.3828/mlo.v0i0.364
PDF: https://www.modernlanguagesopen.org/articles/10.3828/mlo.v0i0.364/galley/497/download/

Reference: Schöch, Christof, Roxana Patras, Tomaž Erjavec, and Diana Santos, ‘Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives’, Modern Languages Open, 1 (2021), 25. DOI: https://doi.org/10.3828/mlo.v0i0.364.

@article{schoch_creating_2021,
     title = {Creating the {European} {Literary} {Text} {Collection} ({ELTeC}): {Challenges} and {Perspectives}},
     issn = {2052-5397},
     shorttitle = {Creating the {European} {Literary} {Text} {Collection} ({ELTeC})},
     url = {http://www.modernlanguagesopen.org/articles/10.3828/mlo.v0i0.364/},
     doi = {10.3828/mlo.v0i0.364},
     language = {en},
     number = {1},
     urldate = {2021-12-17},
     journal = {Modern Languages Open},
     author = {Schöch, Christof and Patras, Roxana and Erjavec, Tomaž and Santos, Diana},
     month = dec,
     year = {2021},
     pages = {25},
 }

Chapter 1.3: The Serbian Part of the ELTeC – from the Empty List to the 100 Novels Collection

Authors: Aleksandra Trtovac, Vasilije Milnović, and Cvetana Krstev.

In this paper we present challenges and solutions in preparing the Serbian part of ELTeC collection, which contains 100 novels written and first published between 1840 and 1920. In the absence of a systematic digital library of Serbian literature this was done from scratch: first, it was necessary to find out which novels existed and could be used, then they had to be retrieved, scanned, corrected and annotated. All this was achieved thanks to enormous efforts of an army of devoted researchers-volunteers. We analyze the results of these efforts and how they fit to the Action’s anticipated outcome.

ELTeC, Serbian, Corpus Building

URL: http://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2021.21.2.1_en
DOI: https://doi.org/10.18485/infotheca.2021.21.2.1

Reference: Trtovac, Aleksandra, Vasilije Milnović, and Cvetana Krstev, ‘The Serbian Part of the ELTeC – from the Empty List to the 100 Novels Collection’, Infotheca – Journal of Digital Humanities, 21.2 (2021), 7–25. DOI: https://doi.org/10.18485/infotheca.2021.21.2.1.

@article{trtovac_serbian_2021,
     title = {The {Serbian} {Part} of the {ELTeC} – from the {Empty} {List} to the 100 {Novels} {Collection}},
     volume = {21},
     issn = {1450-9687, 2217-9461},
     url = {http://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2021.21.2.1_en},
     doi = {10.18485/infotheca.2021.21.2.1},
     number = {2},
     urldate = {2022-02-17},
     journal = {Infotheca},
     author = {Trtovac, Aleksandra and Milnović, Vasilije and Krstev, Cvetana},
     year = {2021},
     keywords = {type_publication},
     pages = {7--25},
 }

Section 2: Annotating ELTeC

Chapter 2.1: Annotation of the Serbian ELTeC Collection

Authors: Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, and Mihailo Škorić.

This paper presents the so-called level-2 edition of SrpELTeC collection developed within the activities of Working Group 2 – Methods and Tools of the COST Action CA 16204 (Distant Reading for European Literary History), and its schema specification. The level-2 edition is a follow-up of the level-1 edition, which is used as input for morphosyntactic and NER annotation of novels. The Serbian level-2 pipeline outlines steps required for production of level-2, including methods and tools used in the process. Some statistics drawn from the Serbian ELTeC level-2 sub-collection brings an interesting insight into collection content.

URL: https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2021.21.2.3_en
PDF: https://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2021.21.2.3_en/259
DOI: https://doi.org/10.18485/infotheca.2021.21.2.3

Reference: Ranka Stanković, Cvetana Krstev, Branislava Šandrih Todorović, and Mihailo Škorić, ‘Annotation of the Serbian ELTeC Collection’, Infotheca – Journal of Digital Humanities, 21.2 (2021), 43–59.

@article{stankovic_annotation_2021,
     title = {Annotation of the {Serbian} {ELTeC} {Collection}},
     volume = {21},
     issn = {1450-9687, 2217-9461},
     url = {http://infoteka.bg.ac.rs/ojs/index.php/Infoteka/article/view/2021.21.2.3_en},
     doi = {10.18485/infotheca.2021.21.2.3},
     number = {2},
     urldate = {2022-02-17},
     journal = {Infotheca},
     author = {Stanković, Ranka and Krstev, Cvetana and Šandrih Todorović, Branislava and Škorić, Mihailo},
     year = {2021},
     keywords = {type_publication},
     pages = {43--59},
 }

Section 3: Analysing ELTeC

Chapter 3.1: Thresholds to the ‘Great Unread’: Titling Practices in Eleven ELTeC Collections

Authors: Roxana Patras, Carolin Odebrecht, Ioana Galleron, Rosario Arias, Berenike J. Herrmann, Cvetana Krstev, Katja Mihurko Poniž, Dmytro Yesypenko.

The main aim of the paper is to describe and, to a certain extent, to understand, titling practices in literary discourse through the exploration of a multilingual literary corpus comprising European novels published between 1840 and 1920. The study is based on the analysis of 11 out of the 16 sub-collections of novels in preparation within the COST Action 16204 “Distant reading for European Literary History”, namely the English, French, German, Italian, Polish, Portuguese, Romanian, Serbian, Slovenian, Spanish, and Ukrainian sub-collections. We focus on an analysis of persons, places and genre entities in titles, and observe some regularities involving the “syntax” of these various entities.

ELTeC, novel titles

URL: http://interferenceslitteraires.be/index.php/illi/article/view/1102

Reference: Patras, Roxana, Carolin Odebrecht, Ioana Galleron, Rosario Arias, Berenike J. Herrmann, Cvetana Krstev, Katja Mihurko Poniž, Dmytro Yesypenko: ‘Thresholds to the “Great Unread”: Titling Practices in Eleven ELTeC Collections’, Interférences Littéraires/Literaire Interferenties, 25 (2021), 163–87. URL: <http://interferenceslitteraires.be/index.php/illi/article/view/1102>.

@article{patras_thresholds_2021,
     title = {Thresholds to the “{Great} {Unread}”: {Titling} {Practices} in {Eleven} {ELTeC} {Collections}},
     volume = {25},
     issn = {2031-2970},
     shorttitle = {Thresholds to the “{Great} {Unread}”},
     url = {http://interferenceslitteraires.be/index.php/illi/article/view/1102},
     language = {en},
     urldate = {2021-10-27},
     journal = {Interférences littéraires/Literaire interferenties},
     author = {Patras, Roxana and Odebrecht, Carolin and Galleron, Ioana and Arias, Rosario and Herrmann, Berenike J. and Krstev, Cvetana and Poniž, Katja Mihurko and Yesypenko, Dmytro},
     month = oct,
     year = {2021},
     pages = {163--187},
 }

Chapter 3.2: Detecting Direct Speech in Multilingual Collection of 19th Century Novels

Authors: Joanna Byszuk, Michał Woźniak, Mike Kestemont, Albert Leśniak, Wojciech Łukasik, Artjoms Šeļa

Fictional prose can be broadly divided into narrative and discursive forms with direct speech being central to any discourse representation (alongside indirect reported speech and free indirect discourse). This distinction is crucial in digital literary studies and enables interesting forms of narratological or stylistic analysis. The difficulty of automatically detecting direct speech, however, is currently under-estimated. Rule-based systems that work reasonably well for modern languages struggle with (the lack of) typographical conventions in 19th-century literature. While machine learning approaches to sequence modeling can be applied to solve the task, they typically face a severed skewness in the availability of training material, especially for lesser resourced languages. In this paper, we report the result of a multilingual approach to direct speech detection in a diverse corpus of 19th-century fiction in 9 European languages. The proposed method finetunes a transformer architecture with multilingual sentence embedder on a minimal amount of annotated training in each language, and improves performance across languages with ambiguous direct speech marking, in comparison to a carefully constructed regular expression baseline.

ELTeC, narrative, direct speech, deep learning

URL: https://aclanthology.org/2020.lt4hala-1.15/
PDF:
DOI:

Reference: Byszuk, Joanna, Michał Woźniak, Mike Kestemont, Albert Leśniak, Wojciech Łukasik, Artjoms Šeļa, and others, ‘Detecting Direct Speech in Multilingual Collection of 19th Century Novels’, in Proceedings of LT4HALA 2020-1st Workshop on Language Technologies for Historical and Ancient Languages, ed. by Rachele Sprungoli and Marco Passarotti (presented at the LT4HALA 2020: 1st Workshop on Language Technologies for Historical and Ancient Languagese, Paris: European Language Resources Association (ELRA), 2020), pp. 100–104 https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/LT4HALAbook.pdf.

Section 4: Beyond ELTeC, Beyond CA16204

Chapter 4.1: The Splendors and Mist(Eries) of Romanian Digital Literary Studies

Authors: Roxana Patras, Ioana Galleron, Camelia Gradinaru, Ioana Lionte, and Lucreţia Pascaru.

The present article is a snapshot of Digital Literary Studies (DLS) in the present-day Romanian academia, higher education curricula, and research evaluation. In the first part, the emphasis falls on the term ―digital turn‖ and on its specific uses and extensions in humanities, as DH (digital humanities), on the one hand, and as digital literary studies/ computer literary studies (DLS/ CLS)/ computational linguistics (CL), on the other. In the second part, we zoom in the field of DLS/ CLS and analyze the way in which it has been localized, operationalized, institutionalized and understood in the Romanian academic environment and publications (DH-targeted journals, humanities journals, and cultural magazines), in higher education curricula (master/ bachelor programs of study), and in designing evaluation standards for DH/ DLS/ CLS research projects (methodologies for funding national research). In the third part, we provide a down-to-earth approach to Romanian DLS by bringing out the experience with digitization, format conversion, manual cleaning, encoding, annotation, and with various editing, quantitative analysis, and data management tools (AntConc, TXM, StyloR, Nooj, Heurist, Transkribus, Oxygen etc.), acquired throughout the implementation of Hai-Ro Project (Hajduk Novels in Romania during the Long Nineteenth Century: digital edition and corpus analysis assisted by computational tools).

Distant Reading, Digital Literary Studies, Digital Humanities, Romania

URL: http://hermeneia.ro/archive/nr-23-2019-topic-interdisciplinarity-an-umbrella-term/
PDF: http://hermeneia.ro/wp-content/uploads/2019/11/18_Patras-et-al.pdf

Reference: Patras, Roxana, Ioana Galleron, Camelia GRĂDINARU, Ioana Lionte, and Lucreţia Pascaru, ‘The Splendors and Mist(Eries) of Romanian Digital Literary Studies: A State-of-the-Art Just before Horizons 2020 Closes Off’, Hermeneia, 23 (2019), 207–22. URL: http://hermeneia.ro/wp-content/uploads/2019/11/18_Patras-et-al.pdf.

@article{patras_splendors_2019,
     title = {The {Splendors} and {Mist}(eries) of {Romanian} {Digital} {Literary} {Studies}: a {State}-of-the-{Art} just before {Horizons} 2020 closes off},
     volume = {23},
     url = {http://hermeneia.ro/wp-content/uploads/2019/11/18_Patras-et-al.pdf},
     journal = {Hermeneia},
     author = {Patras, Roxana and Galleron, Ioana and GRĂDINARU, Camelia and Lionte, Ioana and Pascaru, Lucreţia},
     year = {2019},
     keywords = {type_publication},
     pages = {207--22},
 }

Chapter 4.2: Stylometry in a Bilingual Setup

Authors: Silve Cinková and Jan Rybicki

The method of stylometry by most frequent words does not allow direct comparison of original texts and their translations, i.e. across languages. For instance, in a bilingual Czech-German text collection containing parallel texts (originals and translations in both directions, along with Czech and German translations from other languages), authors would not cluster across languages, since frequency word lists for any Czech texts are obviously going to be more similar to each other than to a German text, and the other way round. We have tried to come up with an interlingua that would remove the language-specific features and possibly keep the linguistically independent features of individual author signal, if they exist. We have tagged, lemmatized, and parsed each language counterpart with the corresponding language model in UDPipe, which provides a linguistic markup that is cross-lingual to a significant extent. We stripped the output of language-dependent items, but that alone did not help much. As a next step, we transformed the lemmas of both language counterparts into shared pseudolemmas based on a very crude Czech-German glossary, with a 95.6% success. We show that, for stylometric methods based on the most frequent words, we can do without translations.

Stylometry, Authorship Attribution, Czech, English

URL: https://aclanthology.org/2020.lrec-1.123/
PDF: https://aclanthology.org/2020.lrec-1.123.pdf
DOI:

Reference: Cinková, Silvie, and Jan Rybicki, ‘Stylometry in a Bilingual Setup’, in Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, ed. by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, and others (presented at the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille: European Language Resources Association, 2020), pp. 977–984. URL: https://www.aclweb.org/anthology/2020.lrec-1.123/.

@inproceedings{cinkova_stylometry_2020,
     address = {Marseille},
     title = {Stylometry in a {Bilingual} {Setup}},
     url = {https://www.aclweb.org/anthology/2020.lrec-1.123/},
     booktitle = {Proceedings of {The} 12th {Language} {Resources} and {Evaluation} {Conference}, {LREC} 2020, {Marseille}, {France}, {May} 11-16, 2020},
     publisher = {European Language Resources Association},
     author = {Cinková, Silvie and Rybicki, Jan},
     editor = {Calzolari, Nicoletta and Béchet, Frédéric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, Hélène and Moreno, Asunción and Odijk, Jan and Piperidis, Stelios},
     year = {2020},
     keywords = {type_publication},
     pages = {977--984},
 }