Workshop on Computational Text Analysis

Fancy Table

Content


In contemporary social science, we are faced with an era of big data. Political actors regularly justify their decisions on various communication channels, institutions publish policy reports, and individuals state their opinions on social media and comment sections of newspaper outlets. But how to make use of these data?

This workshop helps researchers in (1) gathering textual data from publicly accessible webpages, (2) preparing the raw material for analysis, (3) acquiring techniques to analyse the data and (4) understanding recent trends in text- and images-as-data. Thereby, the workshop is structured alongside four input sessions and 2-3 practical sessions.

You can download the syllabus here.

People


Instructors Mirko Wegemann (he/him)
Dr. Eva Krejcova (she/her)
Teaching Assistant Sara Dybesland (she/her)

Schedule

Input session Lab session
30/05/2024, 09:00-11:00 (SR 2) 30/05/2024, 13:00-15:00 (SR 2)
31/05/2024, 10:00-12:00 (SR 2) 31/05/2024, 13:00-15:00 (SR 2)
03/06/2024, 10:00-12:00 (SR 2) 03/06/2024, 13:00-15:00 (SR 2)
04/06/2024, 10:00-13:00 (SR 2) No lab session (but longer input!)

Materials

Please download the files, put them in one directory and create a .Rproj in that directory.
To download the MARPOR data on your own, you can use this script. You need to register for API access at Manifesto Project before. The API key needs to be stored in a .txt-file in your directory.

Session 1: Scraping

Slides Input session Lab session
Slides
Replication code
Exercises
Solution

Session 2: Bags-of-words

Slides Input session Lab session
Slides Replication code
Data
Basic STM
STM with covariates
Results (searchK)
Exercises
EUI Theses (Data)

Session 3: Embeddings and machine learning

For session 3, you need a local installation of Python and GloVe embeddings you can download here
Slides Input session Lab session
Slides for R:
Replication code
Data
Embeddings Matrix
Addition: How to use GPT in R
for Python:
Transformers (Colab) Download raw file here and open in Colab
Training data
Test data
Script (keyATM)
UK Speech Corpus

Credits

A big thanks to Theresa Gessler for her course materials on CTA which can be accessed via this link and Moritz Laurer for his course on Transformer Models at COMPTEXT.