The COST Action Distant Reading for European Literary History is organizing a Training School in Named Entity Recognition & Geo-Tagging for Literary Analysis, co-located with ICARUS conference in Rijeka.

Key information

• Date: 22-25 March 2020

• Place: Online via Zoom

• Trainers: Carmen Brando (School for Advanced Studies in the Social Sciences), Francesca Frontini (Institute for Computational Linguistics in Pisa), Ioana Galleron (Sorbonne-Nouvelle University), Jessie Labov (Central European University; Eötvös Loránd Research Network), Benedikt Perak (University of Rijeka), Maciej Piasecki (Wroclaw University of Technology), Diana Santos (University of Oslo), Ranka Stanković (University of Belgrade), Wiktor Walentynowicz (Wrocław University of Technology), Tomasz Walkowiak (Wrocław University of Technology)

• Host institution: Faculty of Humanities and Social Sciences, University of Rijeka, Croatia

• Organizers: Antonija Primorac, Working Group 3 Leader (antonija.primorac@uniri.hr), Joanna Byszuk, Working Group 2 Leader (joanna.byszuk@ijp.pan.pl), George Mikros, Working Group 2 Co-Leader (gmikros@gmail.com)

• Contact persons: Dmytro Yesypenko, Training School Coordinator (dm.yesypenko@gmail.com), Christof Schöch, Action Chair (schoech@uni-trier.de), Antonija Primorac, Working Group 3 Leader (antonija.primorac@uniri.hr), Joanna Byszuk, Working Group 2 Leader (joanna.byszuk@ijp.pan.pl)

• There is no fee for participation

Background information

Training School Tracks

Workshop 1: Introduction to Named Entity Recognition

22-23 March 2021, 9-15 CET (with lunch break 11.30-12.30)

Coordination: Joanna Byszuk

Trainers: Carmen Brando, Francesca Frontini, Ioana Galleron, Maciej Piasecki, Diana Santos, Ranka Stanković, Wiktor Walentynowicz, Tomasz Walkowiak

This 2-day workshop will introduce the task of Named Entity Recognition (NER) and describe several annotation guidelines and campaigns. The practical part will cover a) basic manual annotation with a selection of tools such as BERT, Inception and Recogito, and the analysis of disagreement among annotators, b) automatic annotation with easy-to-use tools such as CLARIN-PL NER tool suite, c) TEI-encoding of NER annotation, and d) practical exercises in analysing NE contexts as far as description, sentiment and perception are concerned. While some of the practical exercises will focus on English so that everyone is working on the same texts, for a better understanding of the procedures, the workshop is addressed to speakers of all ELTeC languages, so that they can learn enough about NER to work on their collections. Therefore, examples from other languages will also be presented.

Detailed program:

22 March (Monday)
1° 9.00-11.30
A presentation about NER in general
with a break 10.00-10.15
Diana Santos, Carmen Brado

2° 12.30-15.00
Annotation campaigns, practical work with BRAT and CLARIN-PL tools
with breaks 13.05-13.15, 13.50-14.00
Ranka Stanković, Francesca Frontini, Maciej Piasecki

23 March (Tuesday)
3° 9.00-11.30
« Translating » the results into TEI annotation, Annotating NER in TEI
with a break 9.45-9.55
Ioana Galleron, Carmen Brado

4° 12.30-15.00
Analysing NER annotation for literary characters and place names
with a break 13.30-13.40
Ioana Galleron, Carmen Brado

Biographical notes 

Carmen Brando is senior research engineer in Digital Humanities, she holds a PhD in Computer Science and her research interests concern the development and use of natural language processing methods for the humanities and the social sciences. At the moment, her works deals specifically with the annotation of named entities in literary and historical texts as well as automated digitisation and information extraction from modern historical sources. 

Francesca Frontini is research scientist at the Institute for Computational Linguistics in Pisa (ILC-CNR). Her research interests lie in Language Resources, Named Entity Recognition and textual analysis; in particular she worked on NLP methods for the analysis of literary texts and literary criticism. In addition, she published extensively on issues relating to language resource documentation, preservation and standardisation. Since January 2021 she is a member of the BoD of the CLARIN ERIC research infrastructure. 

Ioana Galleron is a professor for French Literature and Digital Humanities at Sorbonne-Nouvelle university. She is specialized in analysing French theatre of the 17th and 18th century with digital tools. In recent years, she worked on the conceptualisation of theatrical characters in a digital paradigm. She is currently involved in various projects of testing NLP tools for the recognition of literary characters. 

Maciej Piasecki is a professor of the Wrocław University of Science and Technology, affiliated in the Department of Computational Intelligence, and works in the areas of Natural Language Engineering, Language Technology and Computational Linguistics. Maciej Piasecki is also national coordinator of CLARIN-PL (http://clarin-pl.eu) research infrastructure – the Polish part of CLARIN ERIC (http://clarin.eu) language technology research infrastructure for Social Sciences and Humanities. He is leader of a large research group which has developed many fundamental resources and tools for Polish, as well as research applications of language technology in SS&H.

Diana Santos has organized three NER evaluation campaigns for Portuguese, called HAREM, back in 2007-2009, and taught about NE in several venues, in Portugal and at ESSLLI. She is currently professor of Portuguese language, and Statistics for Humanities at the University of Oslo. 

Ranka Stanković is associate professor at the University of Belgrade, her field of research is NLP, semantic web, lexical resources, geoinformation management and deep learning. She is the Head of Computer Centre and Chair for Applied Mathematics and Informatics, and Vice-president of the Language Resources and Technologies Society (JERTEH). She published more than 100 research papers, developed several practical NLP systems and tools, and participated in the development of lexical and linguist resources.

Wiktor Walentynowicz is a research assistant at the Wrocław University of Science and Technology, in the field of NLP. His areas of interest are morphological analysis, lemmatization, and normalization of noisy texts. He is currently involved in various projects of creating NLP tools for slavic languages.

Tomasz Walkowiak is an assistant professor at the Department of Computer Engineering, Faculty of Electronics, Wroclaw University of Science and Technology, Poland. He is an author and co-author of over 220 scientific papers. His research interests include pattern recognition, NLP, text mining and open set classification. He is involved in the CLARIN-PL project that facilitates work with very large collections of natural language texts. He is a designer and developer of a set of NLP and text mining tools available at https://ws.clarin-pl.eu.

Workshop 2: Data Analysis, Representation of the Geo-Entities and Enrichment of the Data Using Wikipedia and Google Maps API

24 March 2021, 15:00-17:30h CET

Coordination: Antonija Primorac

Trainer: Benedikt  Perak

Tasks: 1) Data analysis. Using the Google Colab platform and Python scripts, we will load the geo-tagged data and convert it to Pandas dataframe object. This format is useful for creating simple exploratory statistics. For instance, we can calculate the proportion of the geo-tagged data per language, per book, per period, etc.  

2) Representation of the geo-entities.  

2.1.  Getting the Geo-coordinates. In order to represent the geo-names on the map, we need to find appropriate geo-data: longitude and latitude of the place, i.e. geocoordinates. We will explore two methods: 

a) Using the Google Places API (https://developers.google.com/places/web-service/overview) to find the geo-coordinates of the geo-name. The advantage of this approach is the possibility to tap into vast information of the Google Places API that returns information about a variety of categories, places, establishments, prominent points of interest, and geographic locations. You can search for places either by proximity or a text string. A Place Search returns a list of places along with summary information about each place; additional information is available via a Place Details query. 

The downside of this approach is the need to open an account for this type of query system, although it is free for 0–100,000 place requests per month

b) Using the Wikipedia Python package to explore Wikipedia data on geolocated entities. Wikipedia is a Python library that makes it easy to access and parse data from Wikipedia. We will explore the option to find Wikipedia entries by geo-names and geo-coordinates as well to extract the additional data. 

2.2. Mapping geo-names as markers. We will use Folium Package to represent the data as markers with tooltip and HTML popup on the map. This representation will help literary scholars to further discussion about the location of the narratives and the interpretation of literary texts.  

Biographical note 

Benedikt Perak is an assistant professor at the Faculty of Humanities and Social Sciences, University of Rijeka, where he has been teaching courses in the fields of linguistics, digital humanities and data science at the undergraduate and graduate level.

The central area of research interest is related to the implementation and development of methods of digital humanities, natural language processing and data science in the field of social interaction and platforms for the development of digital assistants and advanced communication based on machine learning technologies and artificial intelligence. The list of papers can be found at https://www.bib.irb.hr/pregled/znanstvenici/324151?autor=324151  

Benedikt is the founder and head of the Laboratory for Research of Cultural Complexity at the Department of Cultural Studies (https://cultstud.ffri.hr/?p=541). 

Workshop 3: Relating the Data

25 March 2021, 15.00-17:30h CET Cancelled for the day, will be continued as an asynchronous event later on (participants will be notified)

Coordination: Antonija Primorac

Trainer: Jessie Labov

Following from the work done in Workshop 2, and using a list of locations associated with different texts in the corpus, we will explore ways to enrich and relate this data to knowledge bases and build a larger context around it. Using a variety of sources (Wikidata, Google books, VIAF and national library subject files), the goal is to enrich each dataset using the wider world of linked open data.

Once we have the basic info for each text (birthplace of author, residence at time of writing, place of publication), we can begin to relate the ELTeC geodata to other time/space coordinates, and think about more detailed mapping visualizations that could begin to illustrate the terrain of our initial research questions. 

Biographical note

Jessie Labov is a Resident Fellow in the Center for Media, Data and Society and a Researcher in the Literary Theory Department of the Institute of Literary Studies at the Eötvös Loránd Research Network. At CEU, she worked as a member of the Digital Humanities Initiative, and the Text Analysis Across Disciplines Initiative.  As part of her most recent book project, Transatlantic Central Europe: Contesting Geography and Redefining Culture Beyond the Nation (CEU Press 2019), she carried out a mezzo-scale mapping project on the Polish émigré journal Kultura. She is currently directing the 2nd year of the CEU Summer University Course Cultures of Dissent in Eastern Europe (1945-1989): Research Approaches in the Digital Humanities – Online.

Materials:

On github.