Workshop on Computational Text Analysis

Fancy Table

Content

In contemporary social science, we are faced with an era of big data. Political actors regularly justify their decisions on various communication channels, institutions publish policy reports, and individuals state their opinions on social media and comment sections of newspaper outlets. But how to make use of these data?

This workshop helps researchers in (1) gathering textual data from publicly accessible webpages, (2) preparing the raw material for analysis, (3) acquiring techniques to analyse the data and (4) understanding recent trends in text- and images-as-data. Thereby, the workshop is structured alongside four input sessions and 2-3 practical sessions.

You can download the syllabus here.

People

Instructors	Mirko Wegemann (he/him)
	Dr. Eva Krejcova (she/her)
Teaching Assistant	Sara Dybesland (she/her)

Schedule

Input session	Lab session
24/03/2025, 10:00-12:00 (SR 2)	24/03/2025, 13:00-15:00 (SR 2)
25/03/2025, 09:00-12:00 (SR 2)	25/03/2025, 13:00-15:00 (SR 2)
26/03/2025, 09:00-12:00 (SR 2)	26/03/2025, 13:00-15:00 (SR 2)
27/03/2025, 09:00-12:00 (SR 2)	27/03/2025, 13:00-15:00 (SR 2)

Materials

Please download the files, put them in one directory and create a .Rproj in that directory.
To download the MARPOR data on your own, you can use this script. You need to register for API access at Manifesto Project before. The API key needs to be stored in a .txt-file in your directory.

Session 1: Scraping

Slides	Input session	Lab session
Slides	Replication code	Exercises Solution Script on APIs

Session 2: Bags-of-words

Slides	Input session	Lab session
Slides	Replication code Data Basic STM STM with covariates Results (searchK)	Exercises EUI Theses (Data)

Session 3: Embeddings and machine learning

For session 3, you need a local installation of Python and input embeddings. We will use Numberbatch ensemble embeddings. If you want, you can also download GloVe embeddings to compare.

Slides	Input session	Lab session
Slides	for R: Replication code Data Embeddings Matrix LLM in R for Python: Transformers (Colab) Download raw file here and open in Colab Training data Test data	Script (keyATM) UK Speech Corpus

Credits

A big thanks to Theresa Gessler for her course materials on CTA which can be accessed via this link and Moritz Laurer for his course on Transformer Models at COMPTEXT.