This is the programme for the closing conference of the COST Action Distant Reading for European Literary History. Please also see the overview page for the event with the complete meeting schedule including project-internal meetings.

Registration / Access

Quick links

Thursday, April 21, 2022

14:00-15:30: Session 1, “Creating and Annotating ELTeC”

Opening / Words of welcome

(1) What a difference five years make: achievements and challenges of Distant Reading for European Literary History
(Christof Schöch and Maciej Eder)

(2) Collaborative creation of a multi-lingual literary corpus. Challenges and best practices for corpus design
Lou Burnard, Borja Navarro-Colorado, Carolin Odebrecht, Martina Scholger

(3) Mapping the inner life of characters in the European novel between 1840 and 1920
Tamara Radak, Lou Burnard, Pieter Francois, Fotis Jannidis, Diana Santos

Session chair: Christof Schöch

What a difference five years make: achievements and challenges of Distant Reading for European Literary History

Authors

  • Christof Schöch (1)
  • Maciej Eder (2)

Institutions

  1. University of Trier / Trier Center for Digital Humanities, Germany
  2. Institute of Polish Language, Polish Academy of Sciences, Poland

Keywords

  • COST Action, Distant Reading, achievements, challenges, deliverables, impact

Abstract

This contribution will face the impossible task of summarizing the key achievements, tangible and less tangible outputs, existing and expected impacts, and the many big and small challenges that the many members of the COST Action ‘Distant Reading for European Literary History” have been involved with since everything started at our initial Management Committee Meeting in Brussels back in November 2017.

References

Collaborative creation of a multi-lingual literary corpus. Challenges and best practices for corpus design

Authors

  • Lou Burnard (1)
  • Borja Navarro-Colorado (2)
  • Carolin Odebrecht (3)
  • Martina Scholger (4)

Institutions

  1. Independent researcher, UK/France
  2. Universidad de Alicante, Spain
  3. Humboldt-⁠Universität zu Berlin, Germany
  4. University of Graz, Austria

Abstract

Any project which attempts to design a corpus in the same way as an architect might design a building risks overlooking factors such as the socio-historical background, unforeseen features of its components, or unpredictable requirements of its users. We faced all these challenges during the creation of the European Literary Text Collection (ELTeC) for the COST Action Distant Reading. The project plan required that ELTeC should contain comparable corpora of at least a dozen European languages, each corpus being a balanced selection of one hundred novels from the period 1840 to 1920. We identify two main challenges. Firstly, what exactly is the universe of candidate items to be sampled in the construction of the corpus? What do we mean by “novels from the period 1840 to 1920” and what are the consequences of a chronological constraint which may be entirely culture-specific? Secondly, on what principles are texts chosen for inclusion? In our case we had agreed on four relatively objective “balance criteria” (date of publication, size, gender of author, and “canonicity”) and hoped to achieve a balanced frequency distribution for each of these criteria, so that (for example) there would be good coverage of each time slot and a good mix of text sizes, while also ensuring that the sample set was not dominated by any specific author gender or canonicity-value. This same concern to avoid bias led us to impose an additional constraint that no more than three works should be selected for any given author.

We made no attempt at a formal definition of “the novel”, simply considering it to be any continuous prose fiction of suitable date, length and authorship, excluding translations, collections of short stories, and juvenilia. We wished to avoid prejudicing outcomes relating to questions of genre, but rather to make available sufficiently varied texts for others to investigate them from a pan-European perspective on the basis of our corpora Nevertheless, our minimal criteria for text selection proved to be problematic for many European traditions. For example, some language collections have very few novels from the period 1840-59. In extreme cases, such as Lithuanian or Latvian, linguistic policies did not permit such publications until the turn of the century; but for others such as Serbian or Hungarian, although the language was flourishing, vernacular publications are very hard to find. The situation as regards text length appears very similar: whereas the French, German, and English traditions all provide abundant examples of texts in each of our three size categories, other writing or publishing traditions tend to vary much less in length. Our aim to include comparable numbers of female and male authors was particularly challenging: while most collections exceeded 10% of female authors, for some (such as Croatian or Serbian) even this threshold could not be met. For many language collections therefore, our predefined criteria posed a significant challenge. It remains a subject for future research to determine whether the disparities we identified are a reflection of the actual available population or a consequence of collection bias. The basic concepts of “language” and “country” are problematic in the historical European context. Each of our corpora should consist of novels in a single language, as originally published in a single country; hence the absence of translated works, or of works published outside the author’s country. However, the association between languages, countries, and literary traditions is far from being unequivocal or simple. Many European countries (Spain, Switzerland, Belgium, for example) have two or more official languages, each with its own publishing tradition. In many countries, particularly during the 19th century, the language of culture and consequently of publication was not the national language but rather German, French, or English. The same language or a dialectal variant of it may be found in several different countries (Swiss German, or Belgian French, for example). Different languages may be considered the cultural norm in different regions of a single country (Euskera or Catalan in Spain, for example), and will thus have a claim to be represented by their own language corpus.

We tried to adopt a pragmatic approach to these intractable and politically contentious issues. Works by non-European authors were excluded. Works published originally in some language other than that of their place of publication were also excluded: for example, Czech authors originally published in German (such as Kafka) are excluded from the Czech corpus. However, where subject specialists decided that it was appropriate, we allowed for the creation of additional language corpora such as ELTeC-gsw for Swiss German, or ELTeC-eus for the Basque language.

Despite all these challenges, project participants succeeded in creating a multi-lingual novel collection of comparable corpora respecting a predefined corpus design. This necessitated extensive collaborative discussions and an iterative workflow in which the overall balance of each collection was assessed while it was being constructed, with some texts being moved to an “extension” folder when they were seen to disturb the balance of the whole, as well as targeted searching for categories of text not hitherto represented. We developed a metric, the so called “ELTeC Corpus Composition Criteria Compliance Calculation” or E5C, to assess the extent to which a given collection respected the design and other constraints (see further https://github.com/distantreading/WG1/wiki/E5C-discussion-paper).As of this writing, there are 17 distinct ELTeC corpora, of which all but five have achieved an E5C score in excess of 75%. We believe that our approach to corpus design has made it possible to make meaningful cross-linguistic comparisons about the development of the novel, that quintessentially European literary form.

Keywords
corpus design; multilinguality; corpus metadata

References

  • Lou Burnard, Borja Navarro-Colorado, Carolin Odebrecht (2018): “Sampling Proposals for the ELTeC” (WG1 draft report) https://distantreading.github.io/sampling_proposal.html
  • Lou Burnard, Christof Schöch, Carolin Odebrecht (2021): “In Search of Comity: TEI for Distant Reading”, in: Journal of the Text Encoding Initiative 14. DOI: https://doi.org/10.4000/jtei.3500.
  • Christof Schöch, Roxana Patraș, Diana Santos, Tomaž Erjavec (2021): “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives”, in: Modern Languages Open 1/25. DOI: http://doi.org/10.3828/mlo.v0i0.364.

Mapping the Inner Life of Characters in the European Novel between 1840 and 1920

Authors

  • Tamara Radak (1)
  • Lou Burnard (2)
  • Pieter Francois (3)
  • Fotis Jannidis (4)
  • Diana Santos (5)

Institutions

  1. University of Vienna
  2. Independent Scholar
  3. University of Oxford & Alan Turing Institute
  4. Julius-Maximilians-Universität Würzburg
  5. Linguateca & University of Oslo

Keywords

  • computational literary studies, ELTeC, European novel, literary history, periodisation

Abstract

It has become something of a critical staple to say that what Bradbury and McFarlane term “the inner life, the soul, l’âme, die Seele, sjœlen” (196) of characters became a central preoccupation of the European novel in the mid- to late-nineteenth and early twentieth centuries, a period which roughly corresponds to traditional accounts of literary modernism. This observation seems like a test case cut out for using computational methods, because we can count references to mental states in novels. If this commonly made assumption is true, we should see an increase of these kinds of words; however, two complications arise in this context: First, starting in France, some realistic forms of writing followed the idea of impersonnalité, impartialité, impassibilité, which is linked to the avoidance of an overt narrator, while there is a general shift from showing to telling during this period (Underwood 2019). Secondly, especially towards the end of this period, writers sought new ways of capturing the ever-elusive complexity of human character, developing innovative techniques that challenged traditional(ist) and realist forms of literature. To model these complex processes in the European novel is beyond the scope of a single paper, but we want to take a first step in this direction.

This paper proposes to map the occurrence of verbs relating to inner life in a selection of novels from the ELTeC collection using two different methodologies from computational literary studies. In psychology, these verbs are linked to different aspects of identification with characters in narrative texts representing perceptions, cognitions, evaluations, emotions and more (van Krieken 2017). Our first problem, then, is how to create a historically adequate list of these verbs
for the different languages in the ELTeC collection.

We used level-2 collections from ELTeC for our analysis: English, French, German, Hungarian, Norwegian, Portuguese, Serbian, and Slovenian. In these collections, each word is automatically annotated both with a part of speech and with a root form or lemma. In what follows, by “verb” we mean the lemmatised form of any word annotated as a verb. We employed two methods to obtain verbs representing “inner life”. One method used six arbitrarily selected common seed words (in English: feel, think, believe, know, hope, wish), augmented by 15 others based on their nearest neighbours in language-specific word embeddings. Human annotators then identified in this list the verbs which refer to mental states.

The other method was based on the sorted frequency counts of verbs of the ELTeC collection. Human annotators selected the ten most frequent verbs related to inner life in each collection. Both approaches resulted in a list of verb forms, one containing 96 items and the other containing ten. We then counted the occurrences of each of these verbs in each work,normalizing the count by the number of all verbs in that text. Using the publication dates or the publication decades, we can then plot their relative frequencies over time. In a final step, we will try to evaluate these methods and study how consistent the results are.

The purpose of this paper is to showcase the possibilities as well as the limitations of ELTeC for addressing questions relating to distant reading, literary history and periodisation, making explicit and visible the often implicit biases inherent in this process and the choices made. Our focus lies on potential methodologies for operationalizing such and similar questions of periodisation, with a view towards their potential application to other corpora in the future. At this conference, we propose to present the two methodologies and give an insight into our preliminary findings. We will also discuss some of the difficulties which arise when a project tries to answer questions based on multilingual corpora and how difficult it is to operationalize abstract concepts like inner life in different languages in a comparable manner.

References

  • Bradbury, Malcolm, and James Walter McFarlane. Modernism: A Guide to Literature 1890-1930. Penguin Books, 1991, 1976.
  • ELTeC collections. https://github.com/COST-ELTeC.
  • van Krieken, Kobie, et al. “Evoking and Measuring Identification with Narrative Characters – a Linguistic Cues Framework.” Frontiers in Psychology, vol. 8, 2017, p. 1190. doi:10.3389/fpsyg.2017.01190.
  • Underwood, Ted. Distant Horizons: Digital Evidence and Literary Change. University of Chicago Press, 2019.

15:30-16:00: Break

16:00-17:30: Session 2a, “Analysing ELTeC: Named Entities”

(2) A fine-grained recognition of Named Entities in ELTeC collection using cascades
Cvetana Krstev, Denis Maurel, Ranka Stanković

(2) Distant Reading of ELTeC text collection through Named Entities
Ranka Stanković, Diana Santos, Carmen Brando, Gábor Palkó, Joanna Byszuk

(3) HuWikifier as a distant reading device?
Gábor Palkó, Tamás Kiss

Session chair: Joanna Byszuk

A fine-grained recognition of Named Entities in ELTeC collection using cascades

Authors

  • Cvetana Krstev (1)
  • Denis Maurel (2)
  • Ranka Stanković (3)

Institutions

  1. University of Belgrade, Faculty of Philology, Serbia
  2. Université de Tours, Lifat, Computer Science Research Laboratory
  3. University of Belgrade, Faculty of Mining and Geology, Serbia

Abstract

In the scope of the COST action “Distant Reading for European Literary History” (Schöch et al. 2021; Patras et al. 2021) the working group 2 (WG2) responsible for methods and tools suggested a set of seven named entity (NE) categories to be used for annotating novels (the so-called “level-2” text version). Tags to be used for this set are: PERS, LOC, ORG, WORK, EVENT, ROLE, DEMO (Frontini et. al 2020; Šandrih Todorović et al. 2021). The level-2 version of Serbian novels was produced using this set of categories and tags (Krstev et al. 2019).

For Serbian and French the fine-grained named entity recognition systems were developed based on exhaustive lexicons of corresponding languages and rules implemented in the form of cascades of finite-state automata (Maurel and Friburger 2014; Krstev et al. 2014). These systems were developed using the open-source corpus processing suite Unitex/GramLab and its module CasSys. Both systems recognize and tag a rich set of NE categories and subcategories and allow entity embedding; moreover, the French system recognizes NEs that correspond to TEI guidelines, chapter 13 (TEI P5). An example that illustrates this in French
is (Marquis de la Lande factories):

<org>usines
<persName>
<nameLink>de la</nameLink>
<surname>Lande</surname>
</persName>
</orgName>
</org>

Similarly, in Serbian (Queen Elizabeth of Hungary):

<pers.spec>
<role>kraljice
<top.dr>Ugarske</top.dr>
</role>
<persName.first>Elizabete</persName.first>

Moreover, both systems recognize beside broad categories suggested by WG2 the other categories such as temporal or measurement expressions.

In both Serbian and French systems, the recognition module is separated from the annotation module, which enables production of output as needed. In this paper we will illustrate this on a few Serbian and French novels from ELTeC corpus chosen to match in respect to corpus balance criteria, namely author’s gender, novel’s size, year of first publication. The novels will be annotated with the simplified tags needed for level-2 text format, and with more elaborate TEI compliant tags that reflect all nuances of recognized NEs.

Two output formats for Serbian and French novels will be uploaded into TXM corpus processing systems which will enable both quantitative and qualitative analysis (Krstev et al., 2019). Besides statistical analysis of annotated NER, we will perform contrastive analysis of Serbian and French NEs and for both languages between fine-grained and simplified versions of annotation. The qualitative analysis will reveal interesting examples of annotation, open issues and hard cases. Textometrie analysis in TXM will be illustrated for both fine-grained and simplified versions of annotated samples.

Finally, we will go back to the research questions that were posed by Action’s working group 3 (literary theory and history) when the Action started. Namely the first idea and wish of the WG3 was to produce fine grained annotations that will allow, for instance, distinction between cities and villages, different person’s roles (professions, family relations, etc.), person’s gender, types of locations (continent, country, region, city, village, mountain, waterbody, astronym), etc. After the analysis of availability of NER tools, the fine-grained approach was substituted with a much simpler schema. With this research we would like to reopen these questions and establish whether it is possible to meet the need for more detailed literary analysis based on Named Entities.

References

  • Schöch, Christof, Roxana Patras, Tomaž Erjavec, and Diana Santos, “Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.” In Modern Languages Open, 1 (2021): 25.
  • Patras, Roxana, Carolin Odebrecht, Ioana Galleron, Rosario Arias, Berenike J. Herrmann, Cvetana Krstev, and others, “Thresholds to the “Great Unread”: Titling Practices.” In Eleven ELTeC Collections”, Interférences Littéraires/Literaire Interferenties, 25 (2021): 163–87.
  • Krstev, Cvetana, Jelena Jaćimović, Branislava Šandrih, and Ranka Stanković, “Analysis of the First Serbian Literature Corpus of the Late 19th and Early 20th Century with the TXM Platform.” In DH_Budapest_2019, ed. by Gábor Pálko, (2019): 36–37.
  • Frontini, Francesca, Carmen Brando, Joanna Byszuk, Ioana Galleron, Diana Santos, and Ranka Stanković. “Named Entity Recognition for Distant Reading.” In CLARIN Annual Conference 2020, Proceedings, 05-07 October 2020, (2020): 27–41.
  • Maurel, Denis, Nathalie Friburger, Iris Eshkol-Taravella. „Enrichment of Renaissance Texts with Proper Names.“, Infotheca – Journal for Digital Humanities Volume XV, No. 1, (2014): 29a-41a.
  • Krstev, Cvetana, Ivan Obradović, Miloš Utvić, and Duško Vitas. “A system for named entity recognition based on local grammars.” Journal of Logic and Computation 24, no. 2, (2014): 473-489.
  • Šandrih Todorović, Branislava, Cvetana Krstev, Ranka Stanković, and Milica Ikonić Nešić. “Serbian NER&Beyond: The Archaic and the Modern Intertwinned.” In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), (2021): 1252-1260.
  • TEI P5: Guidelines for Electronic Text Encoding and Interchange, 13 Names, Dates, People, and Places, https://tei-c.org/release/doc/tei-p5-doc/en/html/ND.html.

Distant Reading of ELTeC text collection through Named Entities

Authors

  • Ranka Stanković (1)
  • Diana Santos (2)
  • Carmen Brando (3)
  • Gábor Palkó (4)
  • Joanna Byszuk (5)

Institutions

  1. University of Belgrade, Faculty of Mining and Geology, Serbia
  2. Linguateca & University of Oslo, Norway
  3. School for Advanced Studies in the Social Sciences, France
  4. Department of Digital Humanities – Eötvös Loránd University, Hungary
  5. Institute of Polish Language, Polish Academy of Sciences, Poland

Keywords

  • named entities, distant reading, literary corpus, NER

Abstract

In this paper we present the work carried out to automatically annotate the novels from the ELTeC collection for Named Entities (NE) for Portuguese, Slovenian, Hungarian, French and Serbian, that was carried out in the Working Group 2 “Methods and Tools” (WG2). The summarization, interesting cases and comparison of NE in different language sub-collections, as well as preliminary contrastive explorations will also be presented. Special attention will be given to the open issues and challenges for NE annotation in old novels and in TEI.

The presented results follow up on the NE case study which established common annotation guidelines for 7 categories: DEMO, EVENT, ORG, PERS, PLACE, ROLE and WORK, and testing a selection of NER tools to assess their capacity of automatic reproduction of such annotation (Stankovic et al. 2019). Although the initial idea was to use a common framework for all languages, the (albeit modest) evaluation experiments showed that best results are achieved with tools trained for each language (Frontini et al. 2020). So every language team chose the most appropriate tools for encoding NEs in ELTeC level2.

We will briefly present the tools used for each language and the current state of the annotation. For the Slovenian sub-collection (https://github.com/COST-ELTeC/ELTeC-slv), NE were annotated with Janes-NER (https://github.com/clarinsi/janes-ner) (Fišer et al. 2020) with five categories: PER, LOC, ORG, deriv-per and MISC. The NE recognition for Serbian was performed with the rule-based system SrpNER (Krstev 2014), which was established as the best option in a detailed analysis presented in (Šandrih Todorović et al. 2021). The NER annotation is part of the pipeline for level2 edition described in (Stanković et al. 2021).Portuguese NER was done with PALAVRAS-NER (Bick 2006), a NE recognizer included in a full-fledged parser for Portuguese, PALAVRAS (Bick 2014), which has an option of XML format. Further adaptation for level2 format was done with a Perl pipeline available in the Portuguese ELTeC GitHub. All categories were considered, although DEMO and ROLE, not being “standard” NE, were computed from semantic information produced by the parser. NE recognition for Hungarian was performed by the emBERT model designed by Dávid Nemeskey (https://github.com/DavidNemeskey/emBERT) which outperforms the existing rule-based tools, recognizing person names, organizations and geographical entities. The tool is part of the emts NLP chain. The service is maintained by the Department of Digital Humanities (http://emtsv.elte-dh.hu:5000). As this service does not offer any functionality for XML, we developed a pipeline in Python to add level2 PoS and NER annotations to the TEI XML representation of the Hungarian ELTeC novels. French ELTeC level2 with NER is being built on developments made in the CLARIN ParlaMint project (Erjavec and al. 2022), TEI-compatible tools and linguistic resources that have been successfully used to create linguistic annotations (PoS tagging, lemmatization) and NEs on textual data of parliamentary debates for French (Diwersy and Luxardo 2020). Stanza models and ad-hoc adaptations handle tokenization related issues and take into account specific NE annotation guidelines agreed for ELTeC.

We can already describe some exploratory studies using NER in the ELTeC collection(s): Comparing the amount and type of NEs across periods, gender and canonicity; correlation with proper noun marking; measuring the importance of some NEs across collections: Napoleon, Christmas and Europe The main problems for many languages were: many of the NER tools do not work with TEI formats, they are applied to plain text, alignment with PoS-tagging and lemmatization was an issue and NER tools were not trained for literature of that period, people have different opinions on what NER should be?

References

  • Bick, E. (2014) PALAVRAS, a Constraint Grammar-based Parsing System for Portuguese. In: Tony Berber Sardinha & Thelma de Lurdes São Bento Ferreira (eds.): Working with Portuguese Corpora, pp 279-302. London/New York: Bloomsbury Academic.
  • Bick, E. (2006). Functional Aspects in Portuguese NER. In Renata Vieira, Paulo Quaresma, M. da Graça Volpe Nunes, Nuno J. Mamede, Cláudia Oliveira & Carmelita M. Dias, editors, Computational Processing of the Portuguese Language: 7th International Workshop, PROPOR 2006. Itatiaia, Brazil, May 2006 (PROPOR’2006), pp 80-89. Springer Verlag.
  • Erjavec, T., Ogrodniczuk, M., Osenova, P. et al. (2022) The ParlaMint corpora of parliamentary proceedings. Language Resources & Evaluation. https://doi.org/10.1007/s10579-021-09574-0
  • Diwersy, S., Luxardo, G. (2020) Querying a large annotated corpus of parliamentary debates. LREC, ParlaCLARIN Workshop, 2020, Marseille, France.
  • Fišer, D., Ljubešić, N. andErjavec, T. (2020) “The Janes project: language resources and tools for Slovene user generated content.” Language resources and evaluation 54, no. 1 : 223-246.
  • Frontini, F., Brando, C., Byszuk,J., Galleron, I., Santos, D. and Stanković, R..(2020) “Named entity recognition for distant reading in ELTeC.” In CLARIN Annual Conference 2020. https://hal.archives-ouvertes.fr/hal-03160438/file/Named%20Entity%20Recognition%20for%20Distant%20Reading%20in
  • %20ELTeC-Clarin2020.pdf
  • Stankovic, R., Frontini, F., Erjavec, T., and Brando, C. (2019) ‘Named Entity Recognition for Distant Reading in Several European Literatures’, in DH_Budapest_2019, ed. by Gábor Pálko (presented at the DH_Budapest_2019, Budapest: ELTE, 2019) https://comum.rcaap.pt/bitstream/10400.26/31832/1/Stankovicetal2019.pdf
  • Stanković, R., Krstev, C., Šandrih Todorović, B., Škorić, M. (2021) Annotation of the Serbian ELTeC Collection, Infotheca – Journal for Digital Humanities 21, no. 2: 43-59.
  • Šandrih Todorović, B., Krstev, C., Stanković, R., Ikonić Nešić, M. (2021) “Serbian NER & Beyond: The Archaic and the Modern Intertwinned”, in Proceedings of the International Conference Recent Advances in Natural Language Processing – RANLP 2021, 1-3 September 2021 (virtual), eds. Galia Angelova, Maria Kunilovskaya, Ruslan Mitkov, Ivelina Nikolova-Koleva, pp. 1252-1260. http://dr.rgf.bg.ac.rs/files/original/335b570db10de162d44f07c09f68451c3aa56b74.pdf

HuWikifier as a distant reading device?

Authors

  • Gábor Palkó
  • Tamás Kiss

Institutions

  1. Eötvös Loránd University
  2. Monguz Ltd.

Keywords

  • named entities, distant reading, literary corpus, wikification, named entity linking, geotagging

Abstract

There is a long tradition of using named entity recognition technology in the distant reading of literary texts. However, far fewer attempts have been made to apply named entity linking methods to the interpretation of literary corpora. In the case of Hungarian literature, both approaches face serious difficulties, as the widely used multilingual tools tend to perform poorly in processing Hungarian-language corpora. With the advances in vector space technology, more efficient models than rule-based NER tools have emerged (https://github.com/DavidNemeskey/emBERT), but they have not yet been applied to the analysis of literary corpora. There are no examples of NEL technology being applied in a literary context, and no publicly available NEL tool optimized for Hungarian has been developed to date. Although some Hungarian digital philology projects have used semantic data enrichment, i.e. mapping metadata or text segments to namespaces, they have largely followed a manual or semi-automatic approach. No wonder: there were not enough sophisticated tools for automatic tagging.

A Hungarian wikification tool was developed in the framework of the National Digital Heritage Laboratory. This was necessary because the known multilingual wikification tools perform poorly when analyzing Hungarian texts. Initial testing of the Hungarian wikifier (HuWikifier) showed that morphological analysis of texts is necessary for successful entity recognition, so the emtsv automatic language parser (https://github.com/nytud/emtsv) was integrated into the tool. We then re-tested the performance of HuWikifier using human annotators. Experience showed that most difficulties were encountered in finding and linking relevant common nouns. This could be solved by integrating a keyword search algorithm.

In this presentation, we will attempt to test the performance of HuWikifier on the text of Hungarian novels. The geographic entities of a corpus of 400 novels from the 19th and 20th centuries (https://github.com/ELTE-DH/regenykorpusz/) are identified using the tool. The results are compared with the performance of a multilingual wikification tool, and map visualizations are used to test whether the tool is able to show patterns of literary relevance under the current performance

References

  • Frontini, F., Brando, C., Byszuk,J., Galleron, I., Santos, D. and Stanković, R..(2020) “Named entity recognition for distant reading in ELTeC.” In CLARIN Annual Conference 2020. https://hal.archives-ouvertes.fr/hal-03160438
  • Stankovic, R., Frontini, F., Erjavec, T., and Brando, C. (2019) ‘Named Entity Recognition for Distant Reading in Several European Literatures’, in DH_Budapest_2019, ed. by Gábor Pálko
  • de Does, J., Depuydt, K., Van Dalen-Oskam, K., & Marx, M. (2017). Namescape: named entity recognition from a literary perspective. CLARIN in the Low Countries, 361-370. at the DH_Budapest_2019, Budapest: ELTE, 2019) https://comum.rcaap.pt/bitstream/10400.26/31832/1/Stankovicetal2019.pdf
  • Dekker N, Kuhn T, van Erp M. (2019) “Evaluating named entity recognition tools for extracting social networks from novels.” In PeerJ Computer Science 5:e189 https://doi.org/10.7717/peerj-cs.189
  • de Does, J., Depuydt, K., Van Dalen-Oskam, K., & Marx, M. (2017). Namescape: named entity recognition from a literary perspective. CLARIN in the Low Countries, 361-370.
  • Soudani, A., Meherzi, Y., Bouhafs, A., Frontini, F., Brando, C., Dupont, Y., & Mélanie-Becquet, F. (2019, July). Adapting a system for Named Entity Recognition and Linking for 19th century French Novels. In Digital Humanities 2019.

16:00-17:30: Session 2b, “Also Analysing ELTeC: space and time”

(1) Emotions and space: an investigation of “urban” vs. “rural” emotional language in Swiss-German fiction around 1900
Giulia Grisot, Berenike Herrmann

(2) The Chronological Analysis of Textual Data. A statistical perspective
Fabio Ciotti, Stefano Ondelli, Andrea Sciandra, Floriana Sciumbata, Matilde Trevisani, Luca Tringali, Arjuna Tuzzi

(3) Sentence length across ELTeC collections and Gutenberg Fiction
Christof Schöch

Session chair: Rosario Arias

Emotions and space: an investigation of “urban” vs. “rural” emotional language in Swiss-German fiction around 1900

Authors

  • Giulia Grisot (1)
  • Berenike Herrmann (1)

Institutions

  1. Universität Bielefeld

Keywords

  • computational literary studies, spatial analysis, sentiment analysis, Swiss literature

Abstract

The present paper reports on research conducted in a project closely affiliated with the COST Action ‘Distant Reading for European Literary History’ at the levels of corpus building, methodology development and literary history. In the present paper, we will focus on the latter, presenting an analysis of the representation and affective encoding of fictional landscape in a corpus of Swiss literary works of the late 19th and early 20th Century written in German.

In order to do this, a comprehensive dictionary of spatial entities was compiled, inclusive of terms commonly used in the historical context of the corpus (i.e. Weiher, Weg, Hütte, Berg, See, Straße, Gebäude / pond, path, cottage, mountain, lake, road, building). We collected terms in the broad spatial categories RURAL – consisting of the entities categories ‘natural,’ ‘rural,’ ‘geo-natural’ and ‘geo-rural’ – and URBAN, consisting instead of ‘urban’ and ‘geo-urban’ entities. In line with our project’s research focus, we incorporated as many Swiss-specific terms as possible (i.e. Wiler, Bergli), using existing thesauri (Openthesaurus, Swiss Idiotikon) and corrected, disambiguated, and implemented the lists manually. For geo-urban and geo-rural entities (i.e. Basel, Zürich, Berlin, Rom, Uitikon) as well as for geo-natural entities (i.e. Matterhorn, Rigi, Rhein), we extracted terms from Wikidata and Geonames.rog, harvesting geolocations of Switzerland as well as of its neighbour countries: Austria, France, Germany and Italy.

In terms of spatial entities, our results showed that references to non-fictional Swiss geolocations in our corpus were higher than references to neighbour countries, and particularly prominent for ‘valleys’ and ‘cities’, suggesting that texts at the turn of the century were indeed influenced by ideas about national identity.When looking at the discrete emotions (Russell 1980, Ekman 1982) encoded in proximity of spatial entities, we found an overall higher ‘emotional richness’ for passages containing RURAL entities in comparison to passages containing URBAN ones – both for emotions that are considered ‘positive’ (e.g. joy) and negative (‘fear’). The analysis of valence/polarity showed instead a less clear picture, with different lexicons pointing at different directions and no univocal answer.

While, on the one hand, these results do not necessarily confirm our hypothesis that rural and natural landscape would be depicted as more ‘positive’, the ambiguity in the direction of valence can be interpreted in different ways, with a plausible explanation being the perception of natural landscape as ‘sublime’– thus including fearful/negative as well admiring/positive emotions. With its data-driven perspective to existing theories of rural and urban fictional space, our study offers a first-time account of the affective encoding of fictional space.

References

  • Derungs, Curdin, and Ross S Purves. 2014. “From text to landscape: locating, identifying and mapping the use of landscape features in a Swiss Alpine corpus.” International Journal of Geographical Information Science 28 (6). Taylor & Francis: 1272–93. doi:10.1080/13658816.2013.772184.
  • Ekman, Paul, W V Friesen, and P Ellsworth. 1982. “What emotion categories or dimensions can observers judge from facial behavior?” In Emotion in the Human Face, 98–110. Cambridge.
  • Heuser, Ryan, Mark Algee-Hewitt, and Annalise Lockhart. 2016. “Mapping the emotions of London in fiction, 1700–1900: A crowdsourcing experiment.” In Literary Mapping in the Digital Age, 43–64. Routledge.
  • Jacobs, A. M. (2019). Sentiment analysis for words and fiction characters from the perspective of computational (neuro-)poetics. Frontiers in Robotics and AI, 6, 53. https://doi.org/10.3389/frobt.2019.00053.
  • Kanske, Philipp, and Sonja A Kotz. 2010. “Leipzig Affective Norms for German: A reliability study.” Behavior Research Methods 42 (4): 987–91.
  • doi:10.3758/BRM.42.4.987.
  • Klinger, Roman, Surayya Samat Suliya, and Nils Reiter. 2016. “Automatic Emotion Detection for antitative Literary Studies.”
  • Kaufmann, E., & Zimmer, O. (1998). In search of the authentic nation: landscape and national identity in Canada and Switzerland. Nations and Nationalism, 4(4), 483–510.
  • Plutchik, Robert. 1980. “A general psychoevolutionary theory of emotion.” In Theories of Emotion, 3–33. Elsevier.
  • Projekt Gutenberg DE. 2021. “Projekt Gutenberg-DE.” https://www.projekt-gutenberg.org/.
  • R Core Team. 2021. “R: A language and environment for statistical computing.” Vienna, Austria. https://www.r-project.org/.
  • Russell, James A. 1980. “A circumplex model of affect.” Journal of Personality and Social Psychology 39 (6). American Psychological Association: 1161.
  • Stamm, Nadine. 2014. “Klassifikation und Analyse von Emotionswortern in Tweets fur die Sentimentanalyse.” Bachelorarbeit, Universität Zürich.
  • Võ, M. L. H., Jacobs, A. M., & Conrad, M. (2006). Cross-validating the Berlin affective word list. Behavior Research Methods, 38(4), 606–609.
  • Wiki, Wikimedia Foundation Governance. 2021. “Wikimedia Foundation Governance Wiki.” https://foundation.wikimedia.org/w/index.php?title=Home{\&}oldid=130586.

The Chronological Analysis of Textual Data. A statistical perspective

Authors

  • Fabio Ciotti (1)
  • Stefano Ondelli (2)
  • Andrea Sciandra (3)
  • Floriana Sciumbata (2)
  • Matilde Trevisani (2)
  • Luca Tringali (4)
  • Arjuna Tuzzi (5)

Institutions

  1. Università di Roma Tor Vergata
  2. Università di Trieste
  3. Università di Modena e Reggio Emilia,
  4. independent researcher,
  5. Università di Padova

Abstract

This study illustrates some examples of the statistical analysis of a large diachronic corpus of Italian literary prose. The aim is identifying prototypical trends in word frequencies and identifying word clusters sharing similar temporal patterns.

The search for regularities within the time-series analysis of discrete data has been widely dealt with in statistics. In particular, when it comes to the study of the chronological patterns of word tokens in diachronic corpora, reference can be made to the analyses conducted by Tuzzi 2018, Trevisani & Tuzzi 2018, Trevisani & Tuzzi 2015 on research articles, Trevisani e Tuzzi 2013 on institutional speeches, and Sciandra, Trevisani, Tuzzi 2021 on literary texts. To answer some of the questions relating to the periodization of the ELTeC corpus (Odebrecht, Burnard, Schöch 2021), after selecting different sets of words, functional data analysis (Ramsay and Silverman 2005) and Named Entity Recognition approaches were adopted to extract chronological patterns within the corpus.

Materials and research questions

Pending the completion of the Italian version of the ELTeC, a provisional corpus was compiled including 100 literary works dating 1840-1923 and totalling over 7 million tokens (see section 2: “Esperimenti di Distant Reading. Estrazione, analisi e visualizzazione di dati linguistici da corpora letterari”, Rivista internazionale di tecnica della traduzione, 23, 2021). Pursuant to the periodization proposals put forward by the WG3 of the Distant Reading for European Literary History COST Action, explorative data analyses (EDA) were conducted on the following aspects:

  • a) To trace the development of the depiction of the inner life of characters in the course of the 19th Century, two approaches were adopted: (1) through word-embedding methods, reconstructing the semantic constellation pivoting around six seed-verbs: feel, think, believe, know, hope, wish; (2) the most frequent verbs emerging in the corpus were selected and traced among those describing the characters’ and authors’ inner lives.
  • b) To trace changes in society, according to a NER approach, reference was made to word lists referring to job descriptions, religious titles, military ranks and buildings (residential, industrial, commercial, rural vs urban, etc.). The lists were obtained through a combination of manual (dictionaries and thesauruses) and automatic procedures (e.g. automatic scraping of Internet repositories such as Wikidata).

Methods and software

The tools adopted for the statistical analysis of textual data were developed within the R environment (R Core Team, 2021), both through ad-hoc scripts and by completing pre-existing packages. A lemmatized version was developed in R thanks to the udpipe package (Wijffels 2021, Straka e Straková 2017) and the Italian Stanford Dependency Treebank (ISDT). Word embedding procedures (Mikolov et al. 2013) were conducted according to the GloVe method (Pennington, Socher and Manning 2014). The procedure adopted to detect the chronological development of lemma occurrences proceeds according to a statistical learning approach envisaging three steps:

  1. normalization of raw occurrences
  2. smoothing
  3. curve clustering.

Calculations were made with the help of libraries and packages of the following software: R, text2vec (Selivanov, Bickel and Wang 2020), fda (Ramsay et al. 2021), clusterCrit (Desgraupes 2018), cclust (Dimitriadou 2021), clusterSim (Walesiak e Dudek 2020), kml (Genolini et al. 2005), complemented by ad-hoc coding.

Results

The statistical approach implemented in this study does not reveal clearcut chronological patterns. Although overall trends do emerge, many words do not exhibit distinct behaviours in the course of time. This paves the way for further discussion on the scope and composition of the corpus and the selection of linguistic features to be observed.

References

  • Desgraupes B. (2018) clusterCrit: Clustering Indices, R package version 1.2.8 https://CRAN.R- project.org/package=clusterCrit.
  • Dimitriadou E. (2021) cclust: Convex Clustering Methods and Clustering Indexes. R package version 0.6-23. https://CRAN.R-project.org/package=cclust.
  • Genolini C., Alacoque X., Sentenac M. & Arnaud C. (2015) “kml and kml3d: R Packages to Cluster Longitudinal Data”, Journal of Statistical Software, 65(4), pp. 1-34. http://www.jstatsoft.org/v65/i04/.
  • Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. arXiv:1301.3781.
  • Odebrecht C., Burnard L. & Schöch C. (eds) (2021) European Literary Text Collection (ELTeC), version 1.1.0, April 2021. COST Action Distant Reading for European Literary History (CA16204). DOI: doi.org/10.5281/zenodo.4662444.
  • Pennington J., Socher R. & Manning C.D. (2014) “GloVe: Global Vectors for Word Representation”, in Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543 http://www.aclweb.org/anthology/D14-1162.
  • R Core Team (2021) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
  • Ramsay J. & Silverman B.W. (2005) Functional data analysis, Springer series in Statistics, New York, Springer.
  • Ramsay J.O., Graves S. & Hooker G. (2021). fda: Functional Data Analysis, R package version 5.5.1. https://CRAN.R-project.org/package=fda.
  • Selivanov D., Bickel M. & Wang Q. (2020). text2vec: Modern Text Mining Framework for R. R package version 0.6. https://CRAN.R-project.org/package=text2vec.
  • Straka M. & Straková J. (2017) “Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe”, in Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Vancouver, Canada, pp. 88-99.
  • Trevisani M., Tuzzi A. (2013) “Shaping the history of words”, in Methods and Applications of Quantitative Linguistics: Selected papers of the VIIIth International Conference on QuantitativeLinguistics (QUALICO). Ed. by I. Obradović, E. Kelih & R. Köhler, Belgrade, Akademska Misao, pp. 84-95.
  • Trevisani M., Tuzzi A. (2015) “A portrait of JASA: the History of Statistics through analysis of keyword counts in an early scientific journal”, Quality and Quantity, 49, pp. 1287-1304.
  • Trevisani M., Tuzzi A. (2018) “Learning the evolution of disciplines from scientific literature. A functional clustering approach to normalized keyword count trajectories”, Knowledge-based systems, 146, pp. 129–141.
  • Tuzzi A. (2018) (ed) Tracing the Life Cycle of Ideas in the Humanities and Social Sciences, Cham, Springer.
  • Sciandra A., Trevisani M., Tuzzi A. (2021) Rivista internazionale di tecnica della traduzione/International Journal of Translation, 23, pp. 126-138.
  • Walesiak M. & Dudek A. (2020) “The Choice of Variable Normalization Method in Cluster Analysis”, in (ed.), Education Excellence and Innovation Management: A 2025 Vision to Sustain Economic Development During Global Challenges. Proceedings of the 35th International Business Information Management Association Conference (IBIMA), 1-2 April 2020 Seville, Spain. Ed. by K.S. Soliman, IBIMA, pp. 325-340.
  • Wijffels J. (2021) udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the ‘UDPipe’ ‘NLP’ Toolkit. R package version 0.8.8. https://CRAN.R-project.org/package=udpipe.

Sentence length across ELTeC collections and Gutenberg Fiction

Author

  • Christof Schöch

Institution

  • Trier Center for Digital Humanities, University of Trier

Abstract

The topic of this talk is the evolution of average sentence length in literary texts, with fiction in multiple European languages being investigated and a focus on the time period being ca. 1840-1920. The question: do sentences in novels gets shorter over the course of the nineteenth century?

The key question concerns sentence length in fictional, narrative prose (primarily, but not exclusively novels) in English as well as in several other European languages. The specific claim that this post sets out to test is that the average sentence length that can be observed in fictional narrative texts decreases over the course of the 19th and the early 20th century. This hypothesis has been suggested to me by a colleague who found such an assumption formulated by German theorists of language and grammar in the nineteenth-century, in the context of a more general assumption of a gradual deterioration and simplification of languages since Antiquity. She explained to me that prominent figures such as August Schleicher, Jakob Grimm or Franz Bopp) formulated such views, for example in Schleicher’s Die deutsche Sprache, first published in 1860, and that similar assumptions by others have been documented more recently in books like Konrad Koerner’s Practicing Linguistic Historiography or in the volume edited by Winfred Lehmann, A Reader in Nineteenth Century Historical Indo-European Linguistics. So, specifically and only with respect to sentence length, does the available data bear this out?

References

  • August Schleicher, Die deutsche Sprache, 1860.
  • Konrad Koerner’s Practicing Linguistic Historiography.
  • Winfred Lehmann, A Reader in Nineteenth Century Historical Indo-European Linguistics.

18:00-19:30: Evening keynote

Evening Keynote: Prof. Dr. Karina van Dalen-Oskam (University of Amsterdam and The Huygens Institute for the History of the Netherlands, Netherlands): “Distant Dreaming About European Literary History”

Session chair: Jan Rybicki

Karina van Dalen-Oskam: Distant Dreaming About European Literary History

Abstract

In my talk I will take stock of where we currently stand in Computational Literary Studies and explicitly dream of what we may want to be able to do in the future. What could be the next steps towards more knowledge about the language and function of literature in Europe in the past and the present? What kind of data and tools would we need? Which other research disciplines come into view when we want to answer bigger and bigger questions? And what is the impact our research could have on the multilingual European academic and literary landscape?

Bio

Karina van Dalen-Oskam is research group leader at Huygens Institute (Royal Netherlands Academy of Arts and Sciences) and professor of Computational Literary Studies at the University of Amsterdam. Her research focuses on computational literary studies and the development of methods and techniques for the stylistic analysis of modern Dutch and English novels. She applies these methods to analyze stylistic differences in texts, oeuvres, genres, time periods, and cultures or languages. Proper names in literary texts have her special interest. She is also interested in canon formation. She was project leader of The Riddle of Literary Quality (2012-2019) and currently leads, among other projects, Track Changes: Textual scholarship and the challenge of digital literary writing. At the University of Wolverhampton she is co-investigator in the project Novel Perceptions: Towards an inclusive canonin which the research done in the Dutch Riddle of Literary Quality project is replicated in the United Kingdom.

Friday, April 22, 2022

9:00-10:00: Session 3, “Analysing ELTeC some more: style and characters”

(1) ELTeC and Delta in eleven languages: relatively good news for stylometrists
Jan Rybicki

(2) Imagined differences: approaches to variation in fictional character voices in literary history
Artjoms Šeļa, Joanna Byszuk, Bartlomiej Kunda, Laura Hernández-Lorenzo, Botond Szemes, Maciej Eder

Session chair: Carolin Odebrecht

ELTeC and Delta in Eleven Languages: Relatively Good News for Stylometrists

Author

  • Jan Rybicki (1)

Institution

  1. Institute of English Studies, Jagiellonian University, Kraków, Poland

Keywords

  • Stylometry; word usage distance measure; Burrows’s Delta; Cosine Delta

Abstract

ELTeC is a multilingual set of text collections that can be used to assess the differences in results obtained for most-frequent-words-based stylometric analyses. While the impact of many other parameters of authorship attribution and stylometric distant reading has been discussed ever since John Burrows (2002) introduced his Delta distance measure (e.g. Argamon 2008, Hoover 2004, Eder and Rybicki 2013, Jockers 2013), inter-language studies were comparatively few due to problems with the availability of the material (Rybicki and Eder 2011, Rybicki 2012). ELTeC is not yet perfect in this regard as its components in different languages do not maintain a stable text-per-author ratio, but its eleven largest sets (100 texts each in Czech, German, English, French, Hungarian, Polish, Portuguese, Romanian, Slovenian and Serbian, and 98 in Spanish) represent a fairly comparable spectrum of 19th-early 20th-century realistic novels.

This study analysed Level 2 (where available) or Level 1 versions of ELTeC text collections with stylo (Eder et al. 2016) for R (R Core Team 2021). Frequencies of 100 and 1000 most frequent words were compared within each language to obtain scores for two different distance measures, Classic Delta and Cosine Delta (Smith and Aldridge 2011). The choice of this approach over the usual assessment of authorship attribution success rates helped minimize the impact of the above-mentioned differences between different ELTeC sets.

The results can be described as partially optimistic. The greater reliability of the latter distance measure (Evert et al. 2017) has been reconfirmed, as the two graphs show its tendency to better differentiate between same-author and different-author texts. More importantly, there is a correlation between Cosine Delta results for different-author texts (confirmed by highly significant outcomes of matrix correlation Mantel test). This also shows that, with Cosine Delta and the different-author cohort, there is little difference in distance measure values that could be attributed to language alone.

However, results for texts written by the same authors suggest the impact of language on the ability of both Delta variants to group texts by their authors (although authorial variability within each culture’s literary language concentions may not be rules out as well). Median distance measure values (indicated in both graphs by the bold horizontal lines within each box) show that texts in Portuguese, French and English guessed best with Cosine Delta at 100 most frequent words, and worst in Slovenian, also at 1000 most frequent words. Results for both lengths of the most-frequent-word list followed a similar pattern.

These results point the directions for ELTeC’s further evolution to increase its impact on our knowledge of how quantitative text analysis methods work across languages: the other dozen or so language sets may well be increased and a better representation of individual authors can be achieved.

References

  • Argamon, Shlomo (2008). “Interpreting Burrows’ Delta: Geometric and Probabilistic Foundations.” Literary and Linguistic Computing 23(2): 131-147.
  • Burrows, John (2002). “`Delta’: A Measure of Stylistic Difference and a Guide to Likely Authorship.” Literary and Linguistic Computing 17(3): 267-287, 10.1093/llc/17.3.267.
  • Eder, Maciej, and Rybicki, Jan. “Do birds of a feather really flock together, or how to choose test samples for authorship attribution,” Literary and Linguistic Computing 28 (2): 229-236.
  • Eder, Maciej, Rybicki, Jan, and Kestemont, Mike (2016). “Stylometry with R: A Package for Computational Text Analysis.” R Journal 8 (1): 107-121.
  • Evert, Stefan, Proisl, Thomas, Jannidis, Fotis, Reger, Isabella, Pielström, Steffen, Schöch, Christof, and Vitt, Thorsten (2017). “Understanding and explaining Delta measures for authorship attribution.” Digital Scholarship in the Humanities 32 suppl. 2: ii4-ii16.
  • Hoover, David L. (2004). “Testing Burrows’s Delta.” Literary and Linguistic Computing 19(4): 453–475.
  • Jockers, Matthew L. (2013). Macroanalysis – Digital Methods and Literary History. Champaign, IL: University of Illinois Press.
  • R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
  • Rybicki, Jan, and Eder, Maciej (2011). “Deeper Delta across genres and languages: do we really need the most frequent words?” Literary and Linguistic Computing 26(3):315-321.
  • Smith, Peter W. H., and Aldridge, W. (2011). “Improving Authorship Attribution: Optimizing Burrows’ Delta Method.” Journal of Quantitative Linguistics 18(1): 63-88.

Imagined differences: approaches to variation in fictional character voices in literary history

Authors

  • Artjoms Šeļa (1,2)
  • Joanna Byszuk (1)
  • Bartlomiej Kunda (1)
  • Laura Hernández-Lorenzo (1)
  • Botond Szemes (3)
  • Maciej Eder (1)

Institutions

  1. Institute of Polish Language, Polish Academy of Sciences (Kraków, Poland)
  2. University of Tartu (Estonia)
  3. Eötvös Loránd University (Hungary)

Abstract

Introduction

Since Vladimir Propp’s work, structural narratology has approached fictional characters mainly through their role or function – by what they do, or what is done to them (see Eder, Jannidis, Schneider (eds.), 2010). This character typology relied on recurring sets of functions of actors in the narrative (lover, villain, victim, detective, etc.) and the same perspective was often adopted in computational research, where novelistic characters were modeled on the basis of narrative, not dialogue, parts of text (Bamman et al. 2014; Bonch-Osmolovskaya & Skorinkin 2017; Underwood et al. 2018). There is a parallel way to understand a novel as a device for heteroglossia (following Bakhtin), a form that primarily models the conflict of ideas, discourses and social conditions. Dialogical exchange, “educated conversation” (Moretti 2013) and the clash of styles of reported discourse is central here: instead of how characters act, studies focus on what and how characters speak (Bronwen 2012; Culpeper 2001; Sternberg 1982). Available stylometric research on fictional speech suggests that characters within a text are often distinguishable by their local linguistic patterns without obscuring the global authorial trace (Burrows 1987; Burrows & Craig 2012; Hoover 2017). This, however, was often examined using dramatic material; a comparative glance into novels suggests variation in character distinctiveness across time and authors (Hoover et al. 2014).

We understand the heteroglossia of the novel not as its intrinsic feature, but as a tradition of the literary form that did not develop overnight, or develop equally across social, national and linguistic borders. There are many indirect cues that suggest that distinction in the voice of characters was a historical process: dialogue and dialogue-related features increased their presence in narrative prose over time (Muzny et al. 2017; Sobchuk 2016); quick expansion of third-person narration (Underwood et al. 2013) allowed to restructure character networks (Elson et al. 2010), unlocking the “floating camera” and omnibus narratives. Influence of drama on speech organization in prose (Page 1988) might be another factor, as seen in Don Quixote,which set up the distinction in speech and social origins according to styles in classical tradition sublime, medium and low (Close 1981).

This paper will present several ideas on measuring historical change and individual variation in character’s reported linguistic diversity. We expect a general rise of speech distinction over 19th and early 20th century European novels. However, we don’t assume a uniform trend across national traditions and periods. Newly developed literary languages (like Czech or Latvian) might strive for speech homogeneity; a bourgeois novel might resist vernacular language, while popular literature may demonstrate more vertical sampling across social classes.

Materials and methods

Our primary source to examine variation in the speech of fictional characters across is European Literary Text Collection corpora; we focus primarily on languages represented in Level 2 annotation (which includes information about named entities). We extract direct speech by means of a solution developed by us for a previous study (Byszuk et al. 2020), but coreference resolution and character identification across multiple languages still would prove challenging.

We will introduce two basic directions to approach character speech distinctiveness within one text: geometrical and predictive. First is based on classic stylometry methods and depends on distance measures and points of reference within and outside work (“context-hungry”). Second uses predictive modeling to recognize characters based on their speech samples and is restricted by the amount of available data (“content-hungry”).

There are a multitude of methodological problems we want to discuss with the Distant Reading community: from the ways of standardizing our inference across languages and time (e.g. distance baselines for different languages are likely to be different) to particular predictions for national traditions.

References

  • Bamman, D., Underwood, T., & Smith, N. A. (2014). A Bayesian Mixed Effects Model of Literary Character. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 370–379. https://doi.org/10.3115/v1/P14-1035
  • Bonch-Osmolovskaya, A., & Skorinkin, D. (2017). Text mining War and Peace: Automatic extraction of character traits from literary pieces. Digital Scholarship in the Humanities, 32(suppl_1), i17–i24. https://doi.org/10.1093/llc/fqw052
  • Bronwen, T. (2012). Fictional Dialogue: Speech and Conversation in the Modern and Postmodern Novel. University of Nebraska Press.Burrows, J., & Craig, H. (2012). Authors and Characters. English Studies, 93(3), 292–309. https://doi.org/10.1080/0013838X.2012.668786
  • Burrows, J. F. (1987). Computation into criticism: A study of Jane Austen’s novels and an experiment in method. Clarendon Press ; Oxford University Press.
  • Byszuk, J., Woźniak, M., Kestemont, M., Leśniak, A., Łukasik, W., Šeļa, A., & Eder, M. (2020). Detecting direct speech in 19th-century novels. LREC 2020.
  • Characters in Fictional Worlds: Understanding Imaginary Beings in Literature, Film, and Other Media. (2010). In Characters in Fictional Worlds. De Gruyter. https://doi.org/10.1515/9783110232424
  • Close, A. (1981). Characterization and Dialogue in Cervantes’s “Comedias en prosa.” The Modern Language Review, 76(2).
  • Culpeper, J. (2001). Language and characterisation: People in plays and other texts. Longman.
  • Eder, J., Jannidis, F., & Schneider, R. (Eds.). (2010). Characters in Fictional Worlds: Understanding Imaginary Beings in Literature, Film, and Other Media. De Gruyter. https://doi.org/10.1515/9783110232424.
  • Elson, D. K., Dames, N., & McKeown, K. R. (2010). Extracting Social Networks from Literary Fiction. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 138–147.
  • Hoover, D. L. (2014). The Moonstone and The Coquette: Narrative and Epistolary Style. In Digital Literary Studies. https://doi.org/10.4324/9780203698914-11.
  • Hoover, D. L. (2017). The microanalysis of style variation. Digital Scholarship in the Humanities, 32(suppl_2), ii17–ii30. https://doi.org/10.1093/llc/fqx022.
  • Moretti, F. (2013). Distant Reading. Verso.
  • Muzny, G., Algee-Hewitt, M., & Jurafsky, D. (2017). Dialogism in the novel: A computational model of the dialogic nature of narration and quotations. Digital Scholarship in the Humanities, 32(suppl_2), ii31–ii52. https://doi.org/10.1093/llc/fqx031.
  • Page, N. (1988). Speech in the English Novel. Macmillan.
  • Sobchuk, O. (2016). The Evolution of Dialogues: A Quantitative Study of Russian Novels (1830–1900). Poetics Today, 37(1), 137–154. https://doi.org/10.1215/03335372-3452643.
  • Sternberg, M. (1982). Proteus in Quotation-Land: Mimesis and the Forms of Reported Discourse. Poetics Today, 3(2), 107–156. https://doi.org/10.2307/1772069.
  • Underwood, T., Bamman, D., & Lee, S. (2018). The Transformation of Gender in English-Language Fiction. Journal of Cultural Analytics, 3(2), 11035. https://doi.org/10.22148/16.019.
  • Underwood, T., Black, M. L., Auvil, L., & Capitanu, B. (2013). Mapping mutable genres in structurally complex volumes. 2013 IEEE International Conference on Big Data, 95–103. https://doi.org/10.1109/BigData.2013.6691676.

10:15-11:15: Session 4, “Workflows and infrastructure requirements”

(1) Beyond Babylonian Confusion: a case study-based approach for multilingual NLP on historical literature
Tess Dejaeghere, Julie M. Birkholz, Els Lefever, Christophe Verbruggen

(2) Computational Literary Studies data landscape review and online catalogue
Ingo Börner, Vera Maria Charvat, Matej Ďurčo, Michał Mrugalski, Carolin Odebrecht

Session chair: Joanna Byszuk

Beyond Babylonian Confusion: a case study-based approach for multilingual NLP on historical literature

Authors

  • Tess Dejaeghere
  • Julie M. Birkholz
  • Els Lefever
  • Christophe Verbruggen

Institution

  • Ghent Centre for Digital Humanities, Ghent University

Keywords

Natural Language Processing, Digital Humanities, workflows, Named Entity Recognition, Sentiment Analysis, mixed-method approach

Abstract

Despite the fact that text is a core component of practices within both Natural Language Processing (NLP) and Digital Humanities (DH), a full-fledged cross-pollination between these two research fields remains pending. Differences in user cultures and end goals are at the root of this status quo: while NLP researchers adhere to rigid workflows to cater to linguistics-centered questions and effectuate language model improvement, DH scholars seek to answer meta-textual literary- historical questions in a heuristic framework (Blevins & Robichaud, 2011; Kuhn, 2019; Kuhn & Reiter, 2015; McGillivray et al., 2020; Suissa et al., 2020). Moreover, the level of technical expertise required to adapt NLP tools to literary-historical corpora is often perceived as a hurdle by DH scholars. Text corpora within the field of interest of (digital) humanities researchers are oftentimes riddled with complex characteristics such as character errors produced by Optical Character Recognition (OCR); language changes over time (i.e. concept drift) and multilinguality (Alex et al., 2016; Won et al., 2018). Available powerful state-of-the-art NLP-tools with potential in DH research settings, such as sentiment analysis (SA) and named entity recognition (NER), are most often trained on modern language data, making them less fit for application to historical corpora. The resulting high-error rates deter digital humanists from integrating these tools in their workflows. Conversely, the limited availability of annotated domain-specific historical corpora impedes proper NLP tool development, leaving the application of NLP tools in heuristic research settings underexplored (McGillivray et al., 2020). Clearly, there is an imminent need for transparent, reproducible and durable workflows which are specifically tailored to the needs of DH scholars seeking to answer literary-historical questions using NLP techniques (Kuhn & Reiter, 2015; Parks & Peters, 2022). When NLP tools are being integrated in DH research, the primary focus of the researchers is usually on the interpretation of the tool output to answer the corpus-specific question under consideration rather than on a thorough description of methodologies and tools used throughout the workflow. As a result, NLP workflows which are currently adopted in academic research papers are oftentimes obscure and highly context-dependent, making them difficult to reproduce (Parks & Peters, 2022).

With the aim of answering the call for transparent NLP-driven methodologies in DH and by means of fostering a tool-critical attitude among (digital) humanists, I will present ongoing research regarding a methodological workflow for implementing NLP in (digital) humanities research. Finally, suggestions will be made regarding 1) NLP tool choice across literary-historical corpus research settings 2) interpretation of the tool output and 3) ways of making the used workflow and code accessible in a practical and structured manner to aid reproducibility across similar future research.

References

  • Alex, B., Grover, C., Klein, E., Llewellyn, C., & Tobin, R. (2016). Chapter 9—User-Driven Text Mining of Historical Text. In E. L. Tonkin & G. J. L. Tourte (Red.), Working with Text (pp. 209–230). Chandos Publishing. https://doi.org/10.1016/B978-1-84334-749-1.00009-3.
  • Blevins, C., & Robichaud, A. (2011). 2: A Brief History » Tooling Up for Digital Humanities. Tooling Up for Digital Humanities. http://toolingup.stanford.edu/?page_id=197.
  • Kuhn, J. (2019). Computational text analysis within the Humanities: How to combine working practices from the contributing fields? Language Resources and Evaluation, 53(4), 565–602. https://doi.org/10.1007/s10579-019-09459-3.
  • Kuhn, J., & Reiter, N. (2015). A Plea for a Method-Driven Agenda in the Digital Humanities.
  • McGillivray, B., Poibeau, T., & Fabo, P. R. (2020). Digital Humanities and Natural Language Processing: Je t’aime… Moi non plus. Digital Humanities Quarterly, 014(2).
  • Parks, L., & Peters, W. (2022). Natural Language Processing in Mixed-methods Text Analysis: A Workflow Approach. International Journal of Social Research Methodology, 0(0), 1–13. https://doi.org/10.1080/13645579.2021.2018905.
  • Suissa, O., Elmalech, A., & Zhitomirsky-Geffet, M. (2020). Text analysis using deep neural networks in digital humanities and information science. Journal of the Association for Information Science and Technology, n/a(n/a). https://doi.org/10.1002/asi.24544.
  • Won, M., Murrieta-Flores, P., & Martins, B. (2018). Ensemble Named Entity Recognition (NER): Evaluating NER Tools in the Identification of Place Names in Historical Corpora. Frontiers in Digital Humanities, 5, 2. https://doi.org/10.3389/fdigh.2018.00002.

Computational Literary Studies Data Landscape Review and Online Catalogue

Authors

  • Ingo Börner (1)
  • Vera Maria Charvat (2)
  • Matej Ďurčo (2)
  • Michał Mrugalski (3)
  • Carolin Odebrecht (3)

Institutions

  1. University of Potsdam
  2. Austrian Academy of Sciences, Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH)
  3. Humboldt-University of Berlin

Keywords

  • Computational Literary Studies, Metadata, Linked Open Data, Data Landscape, Research Discovery, FAIR principles

Abstract

Literary works and their digital representations provide the foundation for epistemic debates and discourses in the field of Computational Literary Studies (CLS). The various processings and visualizations, such as digital editions (Sahle 2013) or network analysis (Trilcke 2013) of literary works of all genres (epic, drama, lyric) generate a multitude of heterogeneous data that are interconnected and linked with each other in an increasingly flexible and comprehensive way. This development brings the question of data interoperability into focus, with Linked Open Data (LOD; Heath & Bizer 2011) playing a central role. In our CLS INFRA work packages 5 “Issues of data curation and selection” and 6 “Consolidating and preparing data for CLS”, we aim to systematically compile, curate, and consolidate information about literary works for the CLS community in order to significantly enhance access paradigms for literary data and to improve adherence to FAIR principles (findable, accessible, interoperable, reusable; Wilkinson et al. 2016).

To enhance findability and research-oriented access to literary data for the CLS community, an inventory of the CLS data landscape is needed which applies research-relevant criteria to data selection, acquisition, and description. This inventory, which we conduct in the form of a Data Landscape Review and technically realize as an online Linked Open Data Catalog, is the first step to comprehensively illustrate the existing data landscape as digital heritage for CLS contexts and make it accessible as a foundation for further research.

The inventory will provide a comprehensive overview of the available resources, including detailed descriptive metadata, and will offer rich querying and indexing capabilities through various search and filtering mechanisms. The conceptual starting point for the structured collection of information is theMetamodel for Corpus Metadata (MKM; Odebrecht 2018) – a generic extensible description model accounting for the central entities corpus, document, and annotation, as well as their relationships to each other.

While the model itself is defined abstractly, we elaborate a congruent/corresponding mapping realized as an extension to CIDOC CRM, which allows a representation of the data in RDF (Resource Description Framework). Furthermore, the formalization also allows to express explicitly semantic equivalences to already existing ontologies and schemas adhering/according to the LOD paradigm. In particular, approaches to text and publication classification such as FRBR (IFLA, 1998) and Dublin Core (ISO standard 15836) should be mentioned here. In addition to equivalences on the schema level, the dataset is enriched with references / links to external reference resources such as the authority files GND (Gemeinsame Normdatei), VIAF (Virtual International Authority File), WikiData and GeoNames. These are indispensable to establish semantic interoperability between datasets. The Data Landscape Review as well as the online catalog will give researchers access to a wide range of resources distributed across multiple European providers. It will also offer a comprehensive, domain-specific overview of these resources and include descriptions as well as information, which will facilitate the findability and accessibility of literary texts.

Links

  • CLS INFRA: https://clsinfra.io/
  • CIDOC CRM: https://cidoc-crm.org
  • RDF W3C Recommendation: https:/www.w3.org/TR/rdf-primer/
  • GND: https://gnd.network
  • VIAF: http://viaf.org/
  • Wikidata: https://www.wikidata.org/
  • GeoNames: http://www.geonames.org/

References

  • Fischer, Frank / Börner, Ingo / Göbel, Mathias / Hechtl, Angelika / Kittel, Christopher / Milling, Carsten / Trilcke, Peer (2019): “Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama”, in: Fischer, Frank / Akimova, Marina / Orekhov, Boris (eds.): Digital Humanities 2019. Conference Abstracts, Utrecht University, Moscow, https://dev.clariah.nl/files/dh2019/boa/0268.html.
  • Heath, Tom / Bizer, Christian (2011): “Linked Data: Evolving the Web into a Global Data Space”, 1st edition, in: Synthesis Lectures on the Semantic Web: Theory and Technology, Vol. 1, No. 1 [San Rafael, Calif.]: Morgan & Claypool, S. 1-136, doi: https://doi.org/10.2200/S00334ED1V01Y201102WBE001.
  • IFLA (1998): “Functional Requirements for Bibliographic Records: Final Report”, in: IFLA Series on Bibliographic Control 19 (former UBCIM). München: K.G. Saur Verlag.
  • ISO standard 15836 (2017): “The Dublin Core Metadata Element Set”.
  • Odebrecht, Carolin (2018): “MKM – ein Metamodell für Korpusmetadaten. Dokumentation und Wiederverwendung historischer Korpora”, Dissertation. Humboldt-Universität zu Berlin, Sprach- und literaturwissenschaftliche Fakultät, Berlin. doi: https://doi.org/10.18452/19407.
  • Sahle, Patrick (2013): “Digitale Editionsformen, Zum Umgang mit der Überlieferung unter den Bedingungen des Medienwandels”, 3 Bände, Norderstedt: Books on Demand, in: Schriften des Instituts für Dokumentologie und Editorik, Bände 7-9.
  • Trilcke, Peer (2013): “Social Network Analysis (SNA) als Methode einer textempirischen Literaturwissenschaft”, in: Ajouri, Philip / Mellmann, Katja / Rauen, Christoph (eds): Empirie in der Literaturwissenschaft. Paderborn: Mentis 201–247.
  • Wilkinson, Mark D. / Dumontier, Michel / Aalbersberg, IJsbrand J. / et al. (2016): “The FAIR Guiding Principles for scientific data management and stewardship”. Sci Data 3, 160018. doi: https://doi.org/10.1038/sdata.2016.18.

11:45-13:15: Session 5a, “Beyond ELTeC texts”

(1) What’s in a preface? Sentiment analysis of liminal matter in ELTEC collections
Rosario Arias, Javier Fernández-Cruz, Ioana Galleron, María García-Gámez, Frédérique Mélanie-Becquet, Roxana Patras, Chantal Pérez-Hernández, Olga Seminck

(2) To catch a protagonist … once again. An attempt to recreate a corpus-based study using Linked Data
Ingo Börner, Peer Trilcke, Frank Fischer, Carsten Milling, Henny Sluyter-Gäthje

(3) – Feel free to switch to the parallel session, or start lunch break early 😉

Session chair: Christof Schöch

What’s in a preface? Sentiment analysis of liminal matter in ELTEC collections

Authors

  • Rosario Arias (1)
  • Javier Fernández-Cruz (1,2)
  • Ioana Galleron (3)
  • María García-Gámez (1)
  • Frédérique Mélanie-Becquet (4)
  • Roxana Patras (5)
  • Chantal Pérez-Hernández (1)
  • Olga Seminck (4)

Institutions

  1. University of Málaga
  2. University of Bergamo
  3. Sorbonne-Nouvelle
  4. LATTICE
  5. ‘Alexandru Ioan Cuza’ University of Iasi

Keywords

  • paratext, preface; liminal material; ELTeC; sentiment analysis

Abstract

This presentation aims to analyze liminal divisions in 4 collections of ELTeC (Spanish, French, Romanian, and English) and to propose a functional definition of the introductory units based on their sentiment profiles. In the first part, we briefly introduce theories of prefaces as paratext and emphasize the fact that their repertory of devices has been estimated as being much more stable than readers and authors themselves would believe (Genette 1987, 161-292; Peikola & Bös 2020, 3-33; Rolls & Barcan 2011, 13-26, 137-157, Pelatt 2013, 3-33). Beside a predictable repertory of tropes (mentioned in all treatises of rhetoric), prefaces are related to “moments”, thus to pathetic “intensity”, a feature that opens up a path for SA explorations.

After the theoretical remarks, we explain the preprocessing steps such as:

  1. distinction between author, editor and allographic prefaces;
  2. distinction among original, later, and delayed prefaces;
  3. separation of a novel’s various introductory units in distinct files;
  4. the lower limit of words for the textual units considered for analysis;
  5. the cleaning-up of data (normalization and translation where necessary).

In a second part, we link types of liminal divisions (prefaces, epigraphs, dedications, and other) with ELTeC metadata, so as to see if gender, time period, length or reprint count have an influence on the number and type of liminal texts. Even if this paper is based on a more limited number of collections, we are particularly interested in determining whether, as in the case of ELTeC titles (Patras et al., 2021), there are zones of cultural influence.

In the third part, we focus on the prefaces, because they are the longest amongst liminal units. Rather than listing their pragmatic functions and describe their “empirical historicity” within a close reading approach (Genette 1987, 162), we use sentiment analysis tools to see if their global “sentiment outline” can be related to momentous intensity and to specific tasks (Barros et al 2013; Evgeny et al 2019; Harshita & Mirza 2018) such as previewing/ summarizing, commenting/ explaining, and contextualizing the novels. Unfortunately, no tool supports actually the four languages we intend to work upon. To overcome this difficulty, we use first two different tools (Lingmotif – [www.lingmotif.com]) and the Python library TextBlob, on the English and the Spanish collection (English and Spanish are supported by both tools). We confront the results, then apply TextBlob on French, and Lingmotif on translations of Romanian texts (produced with DeepL https://www.deepl.com/translator, then checked with Rowordnet sentiment attributes http://dcl.bas.bg/bulnet/).

While this method has several disadvantages (use of different tools, of translated texts), it also has the advantage of allowing us both to sketch a new definition to the prefatory genre and to test the robustness of a “combine-and-adapt” approach, which is (and will be for some more time) the day-to-day reality of computer literary studies.

References

  • Barros, Linda, et al. “Automatic Classification of Literature Pieces by Emotion Detection: A Study on Quevedo’s Poetry”. 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, IEEE, (2013), pp. 141–46, https://doi.org/10.1109/ACII.2013.30.
  • Birke, D., & Christ, B. “Paratext and digitized narrative: Mapping the field”. Narrative 21.1(2013): 65–87. doi:10.1353/nar.2013.0003.
  • Evgeny Kim, Roman Klinger. “A Survey on Sentiment and Emotion Analysis for Computational Literary Studies”. In: Zeitschrift für digitale Geisteswissenschaften. Erstveröffentlichung vom 16.12.2019. Version 2.0 vom 23.07.2021. Wolfenbüttel 2021. text/html Format. DOI:
  • 10.17175/2019_008_v2.
  • Genette, Gerard. Palimpsests: Literature in the Second Degree. Trans. by C. Newman and C. Doubinsky. U of Nebraska P, 1997.
  • Genette, Gerard. Paratexts: Thresholds of Interpretation. Trans. by Jane E. Lewin. Foreword by Richard Macksey. Cambridge UP, 2010.
  • Harshita, Jhavar and Paramita Mirza. “EMOFIEL: Mapping Emotions of Relationships in a Story”. In WWW ’18 Companion: The 2018 Web Conference Companion, April 23–27, 2018, Lyon, France. ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3184558.3186989.
  • Henny-Krahmer, Ulrike Edith Gerda. “Exploration of Sentiments and Genre in Spanish American Novels”. Digital Humanities 2018: Puentes-Bridges (DH 2018) (2018): 399–403.
  • Jacobs AM. “Sentiment Analysis for Words and Fiction Characters From the Perspective of Computational (Neuro-)Poetics”. Front. Robot. AI 6:53 (2019). doi: 10.3389/frobt.2019.00053.
  • Maclean, Marie. “Pretexts and Paratexts. The Art of the Peripheral”. New Literary History 22.2 (1991): 273-79. https://doi.org/10.2307/469038.
  • Moreno-Ortiz, A. & Pérez-Hernández, C. Lingmotif-lex: a Wide-coverage, State-of-the-art Lexicon for Sentiment Analysis. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan. May 2018. 2653-2659.
  • Moreno-Ortiz, A. Lingmotif: Sentiment Analysis for the Digital Humanities. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain (2017). Association for Computational Linguistics. 73-76.
  • Patras, Roxana, Carolin Odebrecht, Ioana Galleron, Rosario Arias, J. Berenike Herrmann, Cvetana Krstev, Katja Mihurko Poniž & Dmytro Yesypenko,“Thresholds to the ‘Great Unread’: Titling Practices in Eleven ELTeC Collections”. Interférences littéraires /Literaire interferenties. dir. Chris Tanasescu 25 (2021): 163-187.
  • Peikola, Matti, Birte Bös, eds. The Dynamics of Text and Framing Phenomena. John Benjamins Publishing Company, 2020.Pellatt, Valerie, Ed. Text, Extratext, Metatext and Paratext in Translation. Cambridge Scholars Publishing, 2013.
  • Rolls, Alistair, Marie-Laure Vuaille-Barcan, eds. Masking Strategies Unwrapping the French Paratext. Peter Lang, 2011.
  • Samothrakis, Spyridon, and Maria Fasli. “Emotional Sentence Annotation Helps Predict Fiction Genre”. PLOS ONE, edited by Zhaohong Deng, 10.11 (2015):e0141922, https://doi.org/10.1371/journal.pone.0141922.
  • Skare, Roswitha. “Paratext – a Useful Concept for the Analysis of Digital Documents? Proceedings from the Document Academy 6.1 (2019): 1-12.DOI:10.35492/docam/6/1/12.
  • Taboada, Maite, et al. “Sentiment Classification Techniques for Tracking Literary Reputation”. LREC Workshop: Towards Computational Models of Literary Analysis (2006): 36–43.

To Catch a Protagonist … Once Again. An Attempt to Recreate a Corpus-Based Study Using Linked Data

Authors

  • Ingo Börner (1)
  • Peer Trilcke (1)
  • Frank Fischer (2)
  • Carsten Milling (1)
  • Henny Sluyter-Gäthje (1)

Institutions

  1. University of Potsdam, Germany
  2. Freie Universität Berlin, Germany

Keywords

  • Drama Analysis, Reproducibility, Data Modelling, Linked Data

Abstract

Our paper “To Catch a Protagonist … Once Again” reports on a case study in which we repeated a prior experiment to identify quantitatively dominant characters of a play based on network and count-based measures. We re-implemented the algorithm to replicate the original study and, by following a “method as a microservice” approach, developed a self-describing service to reproduce the results with plays taken from the German Drama Corpus (GerDraCor) of the larger DraCor project. To formally describe the whole research process and its results, we developed a linked data model based on the CIDOC CRM and FRBRoo ontologies. We will demonstrate a digital research environment based on the semantic web publishing tool Research Space to analyze and evaluate the results of this study.

We report on a case study that tries to replicate and reproduce (for the conceptual differentiation between “replication” and “reproduction”, see Schöch 2021) the results of a corpus-based study by making use of linked data technology: In the original paper To catch a Protagonist: Quantitative Dominance Relations in German-Language Drama (Fischer, Trilcke, Kittel et al. 2018) an algorithm is described, that allows to identify the quantitatively dominant characters of a play based on a set of network based (degree, weighted degree, betweenness centrality, closeness centrality, eigenvector centrality) and count based measures (number of scenes a character appears in, number of speech acts, number of spoken words). The algorithm’s original implementation in the tool Dramavis v0.4 (Fischer/Kittel 2017) was tested on a corpus of 465 German language dramas developed in the DLINA project (Fischer/Trilcke 2015), but only the results of five exemplary plays are included in the corresponding conference paper. The process data of the remaining plays has only recently been made available (see https://github.com/dlina/catch-protagonist).

To first re-use the “by-products” of the original study we developed a data model based on the CIDOC CRM ontology (CIDOC CRM Special Interest Group 2021) and its extension CRM DIG (Doerr et al. 2016). The aim is to make the conducted research steps transparent and possibly link them to external concepts (e.g. the Wikidata representation of a network measure). By adhering to CIDOC CRMs modeling principles, we model the research process as a series of activities (e.g crm:E7_Activity; crmdig:D7_Digital_Machine_Event) that make use of tools (crmdig:D14_Software), operate on data (crm:E73_Information_Object; crmdig:D1_Digital_Object) and produce “statements” about the analyzed entities relevant to literary studies – in our case characters in a drama. The values that are calculated within these processes are linked to the entities as so-called “Attribute Assignments” (crm:E13_Attribute_Assignment), thus making transparent on which facts by whom an assertion about an entity was made. Following CIDOCs event based approach, although resulting in rather complex RDF constructs, it has a main advantage: The Knowledge Graph as a registry does not only hold the information on the results but also a detailed description of the conducted actions and the analysis workflow.

In the case of the replication of the DLINA study the “knowledge” about the characters was extracted from the process files with a Jupyter notebook and transformed to RDF. We then re-implemented the original algorithm (Börner 2021) following a “method as a microservice” approach and used our new service’s API to re-calculate the measures based on the original data. We used only the supplied network and count-based measures as input and had the calculations of the rankings done by the microservice, which in turn also produced a detailed RDF description of the whole analysis process according to the above-mentioned model. The generated assertions about the characters then allowed us to test our implementation by comparing them to the results of the character rankings produced in the original study from 2018.

Aiming at a reproduction of the original experiment, we also analyzed the corresponding dramas in GerDraCor, the programmable corpus of German language plays included in the DraCor platform (https://dracor.org/ger). All results are ingested into a triple store and can be jointly analyzed and compared by using SPARQL queries. The triple store is attached to the “Corpora Exploration Platform” based on the tool Research Space (The British Museum 2021) developed within the CLS INFRA project. Within this environment source data (corpora, process files) and conducted research within the Computational Literary Studies based thereof can be now linked together and jointly analyzed.

References

  • Börner, Ingo. 2021. “To catch a protagonist in DraCor”. https://github.com/dracor-org/dracor-notebooks/blob/main/catch-a-protagonist-in-dracor/catch-a-protagonist-in-dracor.ipynb.
  • CIDOC CRM Special Interest Group. 2021. “Definition of the CIDOC Conceptual Reference Model. Version 7.1.1”. https://doi.org/10.26225/FDZH-X261
  • Doerr, Martin et al. 2016. “Definition of the CRMdig. An Extension of CIDOC-CRM to support provenance metadata. Version 3.2.1”. https://cidoc-crm.org/crmdig/sites/default/files/CRMdig_v3.2.1.pdf
  • Fischer, Frank, Peer Trilcke, Christopher Kittel, Carsten Milling, and Daniil Skorinkin. 2018. „To Catch a Protagonist: Quantitative Dominance Relations in German-Language Drama (1730–1930)“. In DH2018: »Puentes/Bridges«. 26–29 June 2018. Book of Abstracts / Libro de resúmenes. Mexico: Red de Humanidades Digitales A. C. https://dh2018.adho.org/to-catch-a-protagonist-quantitative-dominance-relations-in-german-language-drama-1730-1930.
  • Fischer, Frank, Christopher Kittel. 2017. “Dramavis v0.4”. https://github.com/lehkost/dramavis.
  • Fischer, Frank, Peer Trilcke. 2015. “Introducing DLINA Corpus 15.07 (Codename: Sydney)”. https://dlina.github.io/Introducing-DLINA-Corpus-15-07-Codename-Sydney.
  • Schöch, Christof. 2021. “A Typology of Reproducible Research: Concepts, Terms, Examples”. https://dh-trier.github.io/trr
  • The British Museum. 2021. Research Space. ​​https://researchspace.org.

11:45-13:15: Session 5b, “Distant Reading”

(1) Distant Reading for European Literary History: ELTeC, digital sources and digital archives for studying Romanian literature
Luiza Catrinel Marinescu

(2) Combining close and distant reading: A plausible way forward?
Meliha Handzic, Vedad Mulavdic

(3) Beginning with the age-old challenges. Building a didactic resource for digital literature studies
Mads Rosendahl Thomsen

Session chair: Carolin Odebrecht.

Distant Reading for European Literary History: ELTeC, Digital Sources and Digital Archives for Studying Romanian Literature

Author

  • Luiza Catrinel Marinescu

Institution

  1. Romanian Language Institute, Bucharest / University St. Kliment Ohridsky, Sofia

Abstract

Uncovering the patterns and unspoken rules behind literature from a very technical perspective, distance reading in Romanian literature gains an important impulse through the efforts and work in the period 2000-2022.

Many Romanian novels from the last half of the XIXth century and from the first quarter of the XXth century remain rarely (if ever) read any more. Comparing European literature gain in the last years the Romanian term of comparison regarding the evolution of novel, considered to be an indicator of evolution, civilization and culture.

The paper aims to underline the importance of ELTeC, digital sources and digital archives for designing the whole picture of what Romanian literature in the context of European novel writing represents. This area of research has an immediate relevance from the perspective of theory, technology, methodology and application as the use of the corpora and resources gave a new perspective to reading Romanian literature: from both close or distant perspective.

Understanding distant reading as a technological part of the digital humanism, as conversation between literary studies, comparative literature, the COST action Distant Reading for European Literary History offers new inspirational muse for many scholars regarding augmented editions and fluid textuality, comparing distant/close, macro/micro, surface/depth, studying cultural analytics, aggregation, and data-mining, emphasizing visualization and data design, giving an impulse to locative investigation and thick mapping, designing structured archives and distributing knowledge production and performative access to understand better the macroscopic changes in the cultural trends.

References

  • ELTeC-rom (distantreading.github.io)
  • Muzeul Digital al Romanului Românesc | Revista Transilvania
  • Institutului Naţional al Patrimoniului Direcția Patrimoniului Digital (cimec.ro)
  • Biblioteca Centrala Universitara „Carol I” (bcub.ro)
  • Biblioteca Centrală Universitară “Lucian Blaga” Cluj-Napoca | (bcucluj.ro)
  • Biblioteca Centrală Universitară ‘M.Eminescu’ Iași (bcu-iasi.ro)
  • Biblioteca Digitală a Bucureştilor (digibuc.ro)

Combining close and distant reading: A plausible way forward?

Authors

  • Meliha Handzic (1)
  • Vedad Mulavdic (2)

Institution

  1. International Burch University, Bosnia and Herzegovina
  2. University of Sarajevo, Bosnia and Herzegovina

Keywords

  • Distant reading, close reading, literary analysis, case study

Abstract

Close and distant reading are two widely popular terms used to describe traditional and novel approaches to analysing literary texts. Essentially, close reading involves deep examination of central text components by human readers. In contrast, distant reading originally coined by Moretti (2000), generally refers to the use of computational methods to analyse literary texts. Recently, there have been calls to move beyond the limitations of binary oppositions between close and distant reading (Taylor et al. 2018) towards combined models that exploit the aspects of both approaches for the purpose of enabling richer literary research.

In this study, we responded to the above calls by devising and applying a specific combined reading approach to analyse the behaviour of major characters across the text of a Bosnian 19th century novel Zeleno busenje (Green Turf) by Edhem Mulabdic (https://archive.org/details/MulabdiEdhemZelenoBusenje). It was hoped that the synergy between close and distant reading would help us to better address socio-psychological aspects of Bosnian society during the Austro-Hungarian occupation of Bosnia.

We started the analysis process with annotation of the novel characters and creation of a node for each character in the novel. Then we created an edge for each pair of characters that co-occurred at least once on one page. Next, we counted all these co-occurrences in order to capture a wide variety of types of interactions and associations between characters. Finally, in accordance with a distant reading approach, we used Gephi software for social network analysis to uncover key players, their ties, strength and cohesion within the network. We used the degree centrality measure to identify the most important characters in the novel. Then, we extracted and partitioned the network into cohesive subgroups using a modularity algorithm.

We followed this by close reading of pages which mentioned these subgroups to learn more about the nature and significance of their membership. Our initial visualisation showed no single key figure, but rather a group of tightly interconnected characters. From human reading of the text, we learned that these characters are tightly connected members of 2 different families. More importantly, they are typical representatives of the social fabric of the time: liberals and conservatives. We also learned that on an individual level, they are connected by friendship and love interests. However, this idyllicpicture of family life was interrupted by occupation and change. Modularity statistics used to examine the structure of the network identified 3 clusters. Close reading of relevant text passages indicated existing tension between individual and collective issues that split families. Thus, some members sacrificed individual (love) for collective (duty) by joining the futile resistance and facing inevitable death, while others adopted a pragmatic view and tried to adapt to the new reality. In short, these results provide empirical evidence that confirm our earlier presented novel review (https://www.distant-reading.net/zeleno-busenke/).

Overall, the above case study implies that the proposed combined reading approach to literary analysis can help alleviate the weakness of distant reading due to lack of context, while at the same time improving efficiency and effectiveness of the analysis process by directing the researcher’s attention to aspects that require deeper text analysis. More generally, it suggests that a combination of different reading techniques and tools can provide valuable support for complex interpretation of a literary work.

References

  • Moretti F. (2000), Conjectures on World Literature, New Left review, 1 (2000), retrieved from https://newleftreview.org/II/1/franco-moretti-conjectures-on-world-literature.
  • Taylor J.E., Gregory I.N. and Donaldson C. (2018), Combining Close and Distant Reading: A Multiscalar Analysis of the English Lake District’s Historical Soundscape, International Journal of Humanities and Arts Computing, 12(2), 163-182. DOI: 10.3366/ijhac.2018.0220

Beginning with the age-old challenges. Building a didactic resource for digital literature studies

Author

  • Mads Rosendahl Thomsen

Institution

  1. Aarhus University

Abstract

Computational approaches to literary studies have gained a lot of traction in recent years and the continuing digitization of texts makes it difficult not to see digital methods as one of the essential approaches that any student of literature should learn about. However, most students will not be inclined to take up advanced computational studies nor will they have the needed resources in their study programs to do so. Many students will also feel alienated from computational approaches to literature given that reading books was likely the main attraction for their choice of study.

Often students are introduced to tools that have much versatility but little instruction on how to integrate them with the non-digital methods and approaches the students have been trained in. There is also a lack of a systematical overview of how computational approaches can assist in solving questions of literary studies as well as where they may come up short.

This paper will introduce and reflect on the site “Digital Literary Studies – A Companion Guide” that has been edited by Mads Rosendahl Thomsen with contributions from primarily Telma Peura and Emma Risgaard Olsen along with a dozen literary scholars. The site was developed on the themes of an existing introduction to literary studies, Literature: An Introduction to Theory and Analysis, using the 33 subjects of that book to a) reflect on how computational approaches could be applied and to what extent, b) suggest concrete ways to engage with literature, both for beginners and advanced users, c) refer to relevant programs and code, and d) list relevant articles for each subject that reflect the state of the art.

The motivation behind the site is to engage more scholars in reflecting on the uses of computational methods and enable students to enter the field at the competency level they are at, and give them more options to engage with quantitative methods.

The thesis of this paper is that in order to change the field of literary studies and integrate new methods, there should be more investment in how to enable the 95 % of scholars that will see digital methods as a supplement to their practice rather than their core approach. In addition to discussing the design of the resource-site and its pros and cons, I will also welcome suggestions for adding to the site and improving it.

References

  • Peura, Telma et al., Digital Literary Studies – A Companion Guide. Web: https://litdh.au.dk/. Launched May 2020.
  • Thomsen, Mads Rosendahl et al, ed., Literature: An Introduction. London: Bloomsbury, 2018.

14:15-15:45: Closing ceremony with closing keynote

Closing Keynote: Dr. Oleg Sobchuk (Max Planck Institute for the Science of Human History, Jena, Germany): “Distant Reading and Cultural Evolution: A Method Meets a Theory”

Ceremonial handover to CLS INFRA

Session chair: Maciej Eder