```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
if (!require('tidyverse')) install.packages("tidyverse") # data wrangling
if (!require('openxlsx')) install.packages("openxlsx") # import xlsx files
if (!require('haven')) install.packages("haven") # import stata files, option 1
if (!require('readstata13')) install.packages("readstata13") # import stata files, option 2
if (!require('tesseract')) install.packages("tesseract") # import pdf files
if (!require('pdftools')) install.packages("pdftools") # import pdf files
if (!require('marginaleffects')) install.packages("marginaleffects") # plot regression models
```

# Basics in R

Welcome to the introductory R course. In this Markdown script, we will learn some basic functions to import, process, analyze, and visualize data.
Before we dive into the nitty-gritty, here's a so-called **code chunk**.

```{r}
print("Hello World!")
```

As we can see, the code chunk is highlighted differently from our text. We always introduce it on a new line with {r} and end it with . In the snippet, we can write any amount of code, which we can execute via the right-pointing arrow on line 24.

## File Import

In a first step, we will load files into our environment.

### Importing .txt Files

We start with a simple .txt file.

```{r}
txt <- readLines("./SnowballStopwordsGerman.txt")
head(txt)
tail(txt)

# Encoding of umlauts doesn't work, either use iconv to convert the encoding
txt_conv <- iconv(txt, from = "ISO-8859-1", to = "UTF-8")
tail(txt_conv)

# or previous conversion (e.g., via Notepad++) to UTF-8
txt2 <- readLines("./SnowballStopwordsGerman_utf8.txt")
head(txt2)
tail(txt2)
```

The file import also demonstrates a common operation in R. We assign an object to a specific name using "<-" or "=". The function head() allows us to output the first lines of the object (by default the first five lines).

### Importing RDS Files

Although we work a lot with text, we will typically import other file formats that structure the data in a more organized way. Internally, for example, we often use .RDS to load large datasets that we have previously saved into our environment.

```{r}
df <- readRDS("speeches_german.RDS")
summary(df)
```

In this case, we have loaded a dataset with multiple variables into our environment. In R, we call this format a data.frame. Using the built-in summary() function, we can get a quick overview of the contained variables.

### Importing Excel Files

We will even more frequently work with Excel files or CSV files (comma-separated values). For CSV files, we can use pre-installed functions again.

```{r}
df2 <- read.table("./test_set.csv", sep=",")
summary(df2)
```

As we can see, this function requires an additional command. By setting sep="," we specify how the individual columns are separated in the .csv format (in this case by a comma).

If we look at the dataset more closely, however, we notice that the variable names were not imported correctly (they are in the first row). We can easily fix this error.

```{r}
df2 <- read.table("./test_set.csv", sep=",", header=T)
summary(df2)
```

Excel files (.xlsx) are somewhat more complex in structure. Base R cannot import them well. It is often recommended to convert them to .csv in Excel. If this is not possible, we can use a user-written package (openxlsx). 

```{r}
df3 <- read.xlsx("./unemployment_1222.xlsx")
```

### Importing Other File Types

Sometimes files are in other proprietary formats. Here we import, for example, a Stata file into R. We can do this either using the read_dta function from the **haven** package or using **read.dta13** from **readstata13**.

```{r}
# with haven
df4 <- read_dta("./allb18.dta")
table(df4$eastwest)

# with readstata13 
df4 <- read.dta13("./allb18.dta")
table(df4$eastwest)
```

We can also import .pdf files. For this, we need an OCR (optical character recognition) reader.

```{r}
syllabus <- pdf_ocr_text("tada_syllabus.pdf")
syllabus[1]
```

## Data Processing

Often we need to transform data in different ways. For example, we can rename variables, filter datasets, or generate new variables based on existing data. In R, there are various ways to do this, ranging from base R, the built-in functions that come with the R installation. However, in the course of this seminar, we will mostly use dplyr, the data transformation solution from the tidyverse. Sometimes both approaches are presented here.

### Displaying Variable Content

Before we transform variables, we often want to get an insight into their structure. There are various ways to do this.

If we want to view the dataset in the RStudio viewer, for example, we can simply use the View() function.

```{r}
View(df4)
```

If we want to look at a variable, it rarely makes sense to print all values. However, we can (as above) display the first cases.

```{r}
head(df4$eastwest)
```

Instead of the dollar sign, we can also put the variable name in square brackets.

```{r}
head(df4[,"eastwest"])
```

Furthermore, we can get a summary() of the variable, for example the variable di01a, which asks about the net income of participants.

```{r}
summary(df4$di01a)
```

Some numbers seem to make little sense. -50 income? If we look at the codebook for the Allbus 2018, we see that -50 means "No income", -41 corresponds to a data error, and -7 means "Refused".

### Creating New Variables

We can handle these cases in different ways. For example, we can define "No income" as 0.

In the following, we introduce the pipe operator for variable modification. The basic logic is that data operations are executed sequentially.

```{r}
df4 <- df4 %>%
  mutate(di01a = case_when(di01a==-50 ~ 0,
                           TRUE ~ di01a))

summary(df4$di01a)

# with base R
df4$di01a <- ifelse(df4$di01a==-50, 0, df4$di01a)

# alternatively:
attach(df4)
df4$di01a <- ifelse(di01a==-50, 0, di01a)
detach(df4)
```

That worked. We can't do much with the other values. We can code them as missing values (NA).

```{r}
df4 <- df4 %>%
  mutate(di01a = case_when(di01a<0 ~ NA,
                           TRUE ~ di01a))

summary(df4$di01a)
```

### Filtering Datasets

Sometimes we are only interested in a certain category of data. Perhaps we want to conduct an analysis that focuses only on East Germans. To do this, we can filter the entire dataset. But beware: we cannot undo this step. So it makes sense to give the new dataset a new name so that we don't overwrite the old one.

```{r}
# with tidyR
df5 <- df4 %>%
  filter(eastwest=="NEUE BUNDESLAENDER")

# with baseR 
df5 <- subset(df4, eastwest=="NEUE BUNDESLAENDER")
```

We see in the upper left pane that a new object has been generated. It contains significantly fewer observations—only those from East Germany.

### Grouping Data

But let's first go back to our old dataset df4. Now think about the following operation. We want to display the respective average income for East and West Germany. We can achieve this through a combination of the dplyr functions group_by() and summarise().

```{r}
df4 %>%
  group_by(eastwest) %>%
  summarise(average_income = mean(di01a))
```

That didn't work. The reason is that R cannot handle missing values. If we have a missing value, the result of every mathematical operation will be a missing value. But there is a solution.

```{r}
df4 %>%
  group_by(eastwest) %>%
  summarise(average_income = mean(di01a, na.rm=T))
```

The option na.rm=T ignores all missing values. The average income in West Germany is significantly higher than in the eastern part of the country.

Now we can think further: perhaps we want to calculate how far each person deviates from the average income in the part of the country where they live. To do this, we need two steps:
  1. New variable: average income
  2. New variable: difference between average income and own income

```{r}
df4 <- df4 %>%
  group_by(eastwest) %>%
  mutate(average_income = mean(di01a, na.rm=T),
         diff_income = di01a - average_income)

summary(df4$diff_income)
```


### Sorting and Displaying Variables

Sometimes we just want to display a variable rather than save it in a new object. For example, we can display the values on the income variable.

```{r}
df4 %>%
  select(di01a)
```

We can also sort the vector by size (here in descending order).

```{r}
df4 %>%
  arrange(desc(di01a)) %>%
  select(di01a)
```


### Regular Expressions
Before we look at a real example and perform data manipulations with R, let's look at how regular expressions work using the str_view() function. This function is very helpful, especially when we want to test our patterns.

```{r}
words <- c("Parlament", "17. Wahlperiode", "Tagesorientierungspunkt")
str_view(words, "[a-z]")
```

With the above command, we would extract each letter individually, meaning if we put these in a vector, it consists of n letters. If we want to target whole words, we need a quantifier:

```{r}
words <- c("Deutscher Bundestag", "17. Wahlperiode", "Tagesorientierungspunkt")
str_view(words, "[a-z]+")
```

```{r}
str_view(words, "[^a-z]+")
```

Mit dem logischen "Oder" können wir verschiedene Muster definieren. 
```{r}
words <- c("Deutscher Bundestag", "17. Wahlperiode", "Tagesorientierungspunkt")
str_view(words, "Bundestag|Wahlperiode")
```


We imported the syllabus earlier and noticed that not all characters were imported the way we would like. For example, there are many line breaks that are only marked as "\n". To clean text, regular expressions are very helpful. Regular expressions are functions that consider text in an abstract form.

For example, we could extract all numbers from the syllabus using the expression \\d.

```{r}
digits <- str_extract_all(syllabus, "\\d")
```

This extracts each number individually. If we want to extract number sequences, we need to specify that we want to extract more than one number.

```{r}
digits2 <- str_extract_all(syllabus, "\\d{1,}")
```

We previously found that the text contains many line breaks as "\n".

```{r}
str_count(syllabus, "\n")
```

Usually we don't need these as additional information. We can therefore remove them.

```{r}
syllabus2 <- str_remove_all(syllabus, "\n")
syllabus2[1]
```

When we remove line breaks, we sometimes lose the separation between words. It's better to replace the line breaks with a space.

```{r}
syllabus2 <- str_replace_all(syllabus, "\n", " ")
syllabus2[1]
```

We can write patterns in parentheses to capture them individually. This allows us, for example, to change the order of words in a string.

```{r}
words <- c("Deutscher Bundestag", "17. Wahlperiode", "Tagesorientierungspunkt")
str_replace(words, "(\\w+) (\\w+)", "\\2 \\1")
```

We also noticed that our document was imported into a character vector with 9 elements. If we want to join the pages back together, we use the following code.

```{r}
syllabus3 <- str_flatten(syllabus2)
syllabus3
```

However, we might also want to split the text, for example, into sentences. We'll learn better ways to do this later, but one option would be the str_split() function.

```{r}
syllabus_sent <- str_split(syllabus3, "\\.")
```


We can make this data format even more usable.

```{r}
syllabus_sent2 <- unlist(syllabus_sent)

syllabus_sent3 <- data.frame(syllabus_sent)
```


## Loops
We have now learned some basic operations that can help us prepare data. However, there is one important function we haven't discussed yet: loops.

The idea of loops is simple: we perform an operation sequentially on an object. A somewhat abstract example is counting from 0 to 10.

```{r}
for(i in 0:10){
  print(i)
}
```

However, loops have many other practical uses. For scraping websites, for example, in later sessions we need to open multiple URLs in a browser. A for-loop goes through each URL in a list and opens it in the browser.

```{r}
# define browser configuration on mac (if not on mac, comment out with #)
# options(browser = "/usr/bin/open")

urls <- c(
  "https://www.fdp.de/das-wahlprogramm-der-freien-demokraten-zur-bundestagswahl-2025",
  "https://www.reformparty.uk",
  "https://democrats.org"
)

for (url in urls) {
  browseURL(url)
  Sys.sleep(2) # pause for 2 seconds
}
```


## Funktionen
At this point, we won't discuss the concept of functions extensively. However, you should see how a function is structured. Let's use the example of the simple for-loop from line 340 again.

```{r}
for(i in 0:10){
  print(i)
}
```

We can easily convert this into a function.

```{r}
print_no <- function(i){
  return(i)
}

sapply(0:10, print_no)
```

Functions have the advantage that they are more flexible. For example, we can more easily run them in parallel instead of going through each element sequentially. But more on that in the course of the seminar.


## Visualization

In the course of the seminar, we will often try to visualize our results. In some cases, the packages we use already have built-in functions. Sometimes, however, we also need to create our own graphics. For this purpose, ggplot2 from tidyverse is ideal.

Let's return to household income. For example, we could plot the distribution as a histogram.

```{r}
# base R
hist(df4$di05)

# ggplot
df4 %>%
  ggplot(aes(di05)) +
  geom_histogram() 
```

In the example above, there is no major advantage to ggplot. However, the following example shows how flexible ggplot is.

```{r}
df4 %>%
  ggplot(aes(di05)) +
  geom_histogram() + geom_vline(xintercept = mean(df4$di05, na.rm=T)) + xlab("Einkommen") + theme_light()
```


## Regression Analysis 
Often we are interested in exploring relationships between two (or more) variables based on our data. This can be done in R in different ways.

For example, we can display a simple correlation matrix using cor().

```{r}
cor(df4$di01a, df4$di05)
```

Not surprisingly: an individual's income is highly correlated with household income (to which the respondent themselves contributes).

We can also run a t-test for bivariate relationships.

```{r}
t.test(df4$di01a, df4$di05)
```

Before we briefly address multivariate relationships, let's visualize bivariate relationships with ggplot.

```{r}
df4 %>% 
  ggplot(aes(di01a, di05)) +
  geom_point() + geom_smooth()
```

### Linear Regression

Finally, a brief word on implementing regression analyses in R. These are relatively easy to implement using lm(). However, we should prepare a few variables first.

```{r}
table(df4$educ)      # Schulabschluss
table(df4$sex)       # Geschlecht (dichotom)
summary(df4$age)     # Alter
summary(df4$S01)     # schulische Bildung in Jahren
summary(df4$fisei88) # Status des Vaters
summary(df4$misei88) # Status der Mutter

df4 <- df4 %>%
  mutate(educ = as.factor(ifelse(educ %in% c("KEINE ANGABE", "ANDERER ABSCHLUSS", "NOCH SCHUELER"), NA, educ)),
         age = ifelse(age<0, NA, age),
         S01 = ifelse(S01<0, NA, S01),
         sex = as.factor(sex),
         fisei = ifelse(fisei88<0, NA, fisei88),
         misei = ifelse(misei88<0, NA, misei88))
```

Now we can estimate a multivariate regression.

```{r}
m1 <- lm(di01a ~ sex + age + eastwest + S01 + fisei + misei, data = df4)
summary(m1)
```

...one final extension. In the model above, the effects hold under control of the other effects. In reality, however, they often condition each other. For example, it is conceivable that education has a different effect for men and women. We can model these interaction effects quite simply.

```{r}
m2 <- lm(di01a ~ sex*S01 + age + eastwest + fisei + misei, data = df4)
summary(m2)
```

We can already see an effect from the regression output. However, it's even better to plot this relationship.

```{r}
preds <- predictions(m2, newdata = datagrid(sex=c("MANN", "FRAU"), S01=c(0,5,10,15,20,25,30)))
preds %>% 
  ggplot(aes(S01, estimate, color=sex)) + 
  geom_point() + geom_line() + theme_light()
```


### Logistic Regression 

R also allows us to conduct analyses of dichotomous dependent variables. The question here is how the probability of Y occurring changes when X changes.

In the following, we try to identify possible determinants of voter turnout.

A brief preparation of the variables.

```{r}
table(df4$pv03)

df4 <- df4 %>%
  mutate(vote = case_when(pv03 == "KEINE ANGABE" ~ NA,
                                                 pv03 == "NEIN" ~ 0,
                                                 pv03 == "JA" ~ 1))
table(df4$vote)
```


Now we can estimate the model and display the results.

```{r}
m1_log <- glm(vote ~ sex + age + S01 + eastwest + di01a, data=df4, family="binomial")
summary(m1_log)
```

The more people earn, the more likely they are to vote. The same direction of relationship holds for education. Gender and East/West origin have no effect on voter turnout, but age does. The older people are, the more likely they are to vote.

```{r}
preds_log <- predictions(m1_log, newdata = datagrid(age=seq(18,100, 5)))
preds_log %>% 
  ggplot(aes(age, estimate)) + 
  geom_point() + geom_line() + theme_light()
```