Workshop on Computational Text Analysis
Content
In contemporary social science, we are faced with an era of big data. Political actors regularly justify their decisions on various communication channels, institutions publish policy reports, and individuals state their opinions on social media and comment sections of newspaper outlets. But how to make use of these data?
This workshop helps researchers in (1) gathering textual data from publicly accessible webpages, (2) preparing the raw material for analysis, (3) acquiring techniques to analyse the data and (4) understanding recent trends in text- and images-as-data. Thereby, the workshop is structured alongside four input sessions and 2-3 practical sessions.
You can download the syllabus here.People
Instructors | Mirko Wegemann (he/him) |
Dr. Eva Krejcova (she/her) | |
Teaching Assistant | Sara Dybesland (she/her) |
Schedule
Input session | Lab session |
---|---|
24/03/2025, 10:00-12:00 (SR 2) | 24/03/2025, 13:00-15:00 (SR 2) |
25/03/2025, 09:00-12:00 (SR 2) | 25/03/2025, 13:00-15:00 (SR 2) |
26/03/2025, 09:00-12:00 (SR 2) | 26/03/2025, 13:00-15:00 (SR 2) |
27/03/2025, 09:00-12:00 (SR 2) | 27/03/2025, 13:00-15:00 (SR 2) |
Materials
Please download the files, put them in one directory and create a .Rproj in that directory.To download the MARPOR data on your own, you can use this script. You need to register for API access at Manifesto Project before. The API key needs to be stored in a .txt-file in your directory.
Session 1: Scraping
Slides | Input session | Lab session |
---|---|---|
Slides | Replication code | Exercises Solution Script on APIs |
Session 2: Bags-of-words
Slides | Input session | Lab session |
---|---|---|
Slides | Replication code Data Basic STM STM with covariates Results (searchK) | Exercises EUI Theses (Data) |
Session 3: Embeddings and machine learning
For session 3, you need a local installation of Python and input embeddings. We will use Numberbatch ensemble embeddings. If you want, you can also download GloVe embeddings to compare.Slides | Input session | Lab session |
---|---|---|
Slides | for R: Replication code Data Embeddings Matrix LLM in R for Python: Transformers (Colab) Download raw file here and open in Colab Training data Test data | Script (keyATM) UK Speech Corpus |