Workshop on Computational Text Analysis

Fancy Table

Content


In contemporary social science, we are faced with an era of big data. Political actors regularly justify their decisions on various communication channels, institutions publish policy reports, and individuals state their opinions on social media and comment sections of newspaper outlets. But how to make use of these data?

This workshop helps researchers in (1) gathering textual data from publicly accessible webpages, (2) preparing the raw material for analysis, (3) acquiring techniques to analyse the data and (4) understanding recent trends in text- and images-as-data. Thereby, the workshop is structured alongside four input sessions and 2-3 practical sessions.

You can download the syllabus here.

People


Instructors Mirko Wegemann (he/him)
Dr. Eva Krejcova (she/her)
Teaching Assistant Sara Dybesland (she/her)

Schedule

Input session Lab session
24/03/2025, 10:00-12:00 (SR 2) 24/03/2025, 13:00-15:00 (SR 2)
25/03/2025, 09:00-12:00 (SR 2) 25/03/2025, 13:00-15:00 (SR 2)
26/03/2025, 09:00-12:00 (SR 2) 26/03/2025, 13:00-15:00 (SR 2)
27/03/2025, 09:00-12:00 (SR 2) 27/03/2025, 13:00-15:00 (SR 2)

Materials

Please download the files, put them in one directory and create a .Rproj in that directory.
To download the MARPOR data on your own, you can use this script. You need to register for API access at Manifesto Project before. The API key needs to be stored in a .txt-file in your directory.

Session 1: Scraping

Slides Input session Lab session
Slides
Replication code
Exercises
Solution
Script on APIs

Session 2: Bags-of-words

Slides Input session Lab session
Slides Replication code
Data
Basic STM
STM with covariates
Results (searchK)
Exercises
EUI Theses (Data)

Session 3: Embeddings and machine learning

For session 3, you need a local installation of Python and input embeddings. We will use Numberbatch ensemble embeddings. If you want, you can also download GloVe embeddings to compare.
Slides Input session Lab session
Slides for R:
Replication code
Data
Embeddings Matrix
LLM in R
for Python:
Transformers (Colab) Download raw file here and open in Colab
Training data
Test data
Script (keyATM)
UK Speech Corpus

Credits

A big thanks to Theresa Gessler for her course materials on CTA which can be accessed via this link and Moritz Laurer for his course on Transformer Models at COMPTEXT.