When working with real data, we often have to deal with data that aren’t 100% properly formatted. Problems can include numbers with delimiting commas between thousands, date fields with different separators, date fields with both month-year and year-month observations, or even numeric fields with one inexplicable string observation. We could fix these problems manually, either by editing a .csv file in Excel, or by using subsetting and which() in R, but this is not the ideal solution. We might miss mistakes, we might make new mistakes while tring to fix existing ones, and this process can be very time consuimg if there are a lot of observations to check.

## Regular Expressions

A better way of fixing these problems is to use functions that can automatically handle these errors. You can think of these functions as an automated version of a find-and-replace. To get an idea of how useful these functions can be, load the file Peace Agreements.RData and take a look at the UCDP Peace Agreements Data (PA). Which variable(s) might cause problems if we were to try and feed them to a function?

load('Peace Agreements.RData')

The Duration variable is a mess. It has observations in month-day-year format, year-month format, year-month-day format, and some that are just years. On top of this, some observations use slashes while others use dashes. Some of the year-month observations end in trailing dashes. If we tried to pass this variable to the as.Date() function, we’d get all kinds of errors. Step one to fixing this is to make sure we’re using the same separator for all observations. Look up the gsub() function, and use it to make sure that all observations in Duration use the same date separator.

PA$Duration <- gsub(pattern = '/', replacement = '-', x = PA$Duration)

Functions like gsub() are actually a wrapper to regular expressions, or regex. Regular expressions are a general concept in computer science that allow users to identify and locate every instance within a piece of text that matches a pattern string that you supply. Notice that when you looked up gsub(), the help file included descriptions of multiple functions. There are many different operators in regex, and when combined, they can be incredibly powerful. Regular expressions are powerful, but can be very confusing. Luckily, there are online tools you can use which visualize what a regex is actually doing. These are incredibly handy for testing a regex when you’re trying to solve a text problem. This tool is great because it highlights each part of your regex and tells you exactly what it is doing. Unfortunately, it’s written for *nix regex, so you can’t just copy and paste your regex into R. Luckily somone has written a version for R flavored regex; it’s not quite as full featured, but you can use your regex directly in R.

The basic syntax of a regex is that you include a string that you wish to find, and an operator telling the regex where in the string to look for it. Once you’ve constructed the regex, you have to give it some text to evaluate. The simplest operators, and the ones you’ll likely use the most, are ^, which looks for the pattern at the beginning of the string, and $, which looks for the pattern at the end of the string. The ^ goes before the leading character(s) you are looking for, and the $ goes after the trailing character you are looking for. If you want to search for a range of characters, say ‘a’ through ‘f’, or 1 through 3, you can use square brackets around the characters to search the whole range e.g. ^[a-g] will match any strings that begin with the letters ‘a’ through ‘g’, while [127-9]$ will match any strings ending in 1, 2, 7, 8, or 9. Use regex to find all of the observations on Duration that end with a trailing dash. grep(pattern = '-$', x = PA$Duration) ## [1] 41 42 46 49 50 51 52 53 54 64 80 81 87 90 94 96 158 ## [18] 210 211 The basic grep() function returns a vector indicating the elements in x that match the expression. We can also use grepl() which returns a logical vector the length of x indicating whether each element in x matches the expression or not. grepl(pattern = '-$', x = PA$Duration) ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE ## [45] FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE ## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [78] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE ## [89] FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE ## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [155] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [166] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [177] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [199] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [210] TRUE TRUE FALSE FALSE FALSE FALSE FALSE However, regex only match one character at a time. Consequently, when the pattern you are searching for is 11, regex doesn’t know that this means eleven. Instead, it looks for two ones in a row. This means that we cannot use the syntax [10-19] to search for values between 10 and 19. Using this pattern will instead return matches for 1, 0-1, and 9 i.e. 0, 1, and 9. To get around this, we can put the leading digit outside the brackets, as in 1[0-9]$, which will return any strings ending in 10 through 19. Write a regular expression that finds every value of pa_date in the 70s and 80s. We can use regex to match arbitrarily complicated conditions; write an expression that matches every value of pa_date in the final two years of each decade in the 80s and 90s.

grep(pattern = '19[7-8]', x = PA$pa_date) ## [1] 1 2 6 7 21 34 55 56 63 71 79 111 112 113 114 115 116 ## [18] 117 118 119 120 121 122 125 177 178 179 191 192 209 grep(pattern = '19[8-9][8-9]', x = PA$pa_date)
##  [1]   4   5   7  21  46 108 109 125 129 157 158 169 174 186 187 192 209

We can also use the OR operator, which is defined by | in regex. This allows us to look for strings which match one of two (or three or n) different conditions. For example, if we wanted to find all observations in a vector of words that begin or end in vowels, we would use ^[aeiou]|[aeiou]$. Write a regex that looks for Angola at the beginning of Name or Northern Ireland at the end. Note that regex are case-sensitive by default, so we can either capitalize those names, or see if there’s a way to ignore case when matching… PA[grep(pattern = '^Angola|Northern Ireland$', x = PA$Name), ] If we want to get the actual strings that match our expression, we have two options. The first is to set value = T in our call to grep(). Use this argument to get the dates of all peace agreements signed in 2010 or later. If we know that there are no letters in our dates, we can take a shortcut and use the . operator, which will make the expression look for any character. If we’re less certain of how well formatted out dates are, however, we would want to use [0-9] to ensure we only match elements with a number in the final year digit. grep(pattern = '201.$', x = PA$pa_date, value = T) ## [1] "2/23/2010" "6/28/2011" "6/6/2010" This is handy, but what if we want to return the entire observation for which the expression is true, i.e. the rows in the dataframe where x matches the expression? Luckily, we don’t have to look beyond the base behavior of grep() to do this. By default, grep() returns a numeric vector (or scalar, if there’s only one match) that lists which elements of x match the expression. How can we use this to capture whole rows in a dataframe? Try to get every observation in PA with a Duration value in the 70s. PA[grep(pattern = '197.', x = PA$Duration), ]
PA[grepl(pattern = '197.', x = PA$Duration), ] Notice that both grep() and grepl() return the exact same subset of the dataframe! This is because we can use both logical and numeric vectors to index elements of objects in R. Use regex functions and your knowledge of subsetting to limit the peace agreements dataframe to agreements which were signed in February, March, April, June, November, or December (pa_date is the signing date). PA[grep('^[2-46]|^1[1-2]', PA$pa_date), ]
PA[grepl('^[2-46]|^1[1-2]', PA$pa_date), ] We can also combine different regex R functions to carry out multiple operations in one line. We can use the various flavors of grep() to match patterns, and we can use the various forms of sub() to find and replace them. Our leading or trailing vowel match example above has a problem. The Name variable has conflict names, meaning that it includes the name of a country and the name of the second combatant. What if we just want to know which of the first listed countries’ names begin or end with a vowel? Use regex functions to isolate the first country name in Name, identify which country names begin or end in a vowel, and then return a dataframe of all of these observations. PA[grep('^[aeiou]|[aeiou]$', sub(":.*", "", PA$Name)), ] We can also combine different regex functions when generating identifier variables. Download and install the countrycode package, which is used to convert between a number of international country ID code systems. We need to limit the Name variable to just the country name before the colon, and remove any old country names in parentheses. Once we’ve done this, we can feed the resulting vector of country names to countrycode with arguments origin = 'country.name' and destination = 'cown' to convert country names to numeric COW codes. See if you can do this in one line of code. library(countrycode) PA$COWcode <- countrycode(sub("-.*", "", sub(":.*", "", PA$Name)), 'country.name', 'cown', warn = F) ## Dates We’ve been working with dates thus far because dates are often stored in datasets as strings. This is done so that they are human readable, and we can easily glance at a dataset and be able to tell when each observation occurred. Unfortunately for us if we want to do any kind of time-series or duration analysis, a computer has no idea what July 2, 1964 or 7/2/1964 mean. R stores dates in POSIX format, which is defined as the number of seconds since midnight UTC, January 1, 1970. Luckily there are functions which can convert between human readable dates and POSIX time, but human readable dates are strings, which means they can be prone to many of the problems illustrated above. We have a few options here. The as.Date() function in base R works when our dates are perfectly formatted. The pa_date variable is – thanks to our efforts above – correctly formatted, so try creating a date object out of it. pa_dates <- as.Date(PA$pa_date, format = '%m/%d/%Y')
class(pa_dates)
## [1] "Date"

Now that we’ve converted these string fields to date objects, we can do many standard math operations on them.

pa_dates[3] - pa_dates[6]
## Time difference of 2033 days
pa_dates[72] > pa_dates[200]
## [1] FALSE
mean(pa_dates[22:94])
## [1] "1996-08-05"

This is because dates in R are actually numeric variables. We can see this by using as.numeric() on a date object. Remember, each of these numbers is the number of seconds since midnight January 1, 1970.

as.numeric(pa_dates[4:16])
##  [1] 10745 10326  5829  6728 12045 13635 13693 13928 13931 13932 13937
## [12] 13938  9359

The Duration variable is a little less well formatted. Try the same approach on it. We either get an error, or a vector of NAs, depending on whether we specify a format as an argument or not. We could manually rewrite the offending entries, but this would take a lot of time and risks transcription errors. The lubridate package has a number of date and time related functions that vastly improve on the functionality of base R in this regard. Download and install lubridate, and see if there is a function in the package that can successfully convert Duration into a new date object.

library(lubridate)
durations <- as.Date(parse_date_time(PA$Duration, c('mdy', 'ymd', 'ym', 'y'))) ## Warning: 1 failed to parse. This still results in one error, as one date does not parse, even with the much more flexible parse_date_time(). This is a good example of why it is still important to look at your data, even when you’re writing code which will automatically handle many different string issues. Write a line of code that lets you identify the offending observation. Note, that you can’t use !is.na() because a blank string does not count as NA i.e. is.na('') returns false. How can we match an empty string? PA[is.na(durations) & !grepl('^$', PA$Duration), ] PA[is.na(durations) & grepl('.', PA$Duration), ]

Both of these approaches work because an empty string will match both patterns. The . operator represents any character, but there are no characters in an empty string. Simlarly, an empty string does not have both a first and last character. We can see that observation 179, a civil conflict in Chad, has a Duration of June 31, 1979. June 31 does not exist, so this is the source of our error. We’re going to take the path of least resistance, and assume that the coder meant to choose the end of June, so go ahead to change this to June 30.

PA$Duration[PA$Duration == '1979-06-31'] <- '1979-06-30'

## File Names

Another area where strings can be very important to our code is when working with file names. Often you just have to load a couple .csv files at the start of a script, and that’s all you need to worry about. If you have a dataset with separate files for individual years or districts (or whatever), or if your code produces lots of output that you need to save, then string knowledge can save you lots of time.

Write some code that loads the four .csv files of UN international trade statistics using a loop. Your code should also name the resulting dataframes with the years that they contain. You could accomplish this with trade_data <- lapply(csvs, read.csv), but this would return a list of four data frames. While this may be preferable if you need to iterate over the data frames, this is not always the case. Make sure your code results in four separate dataframes. Since we’re dynamically creating object names, we can’t use our normal <- assignment operator. Instead, take a look at assign(). Use list.files() to get the names of the individual files. Notice that the function has a pattern argument; yep, this is also a regex!

csvs <- list.files(path = '.', pattern = '[0-9].csv')
extract_year <- function(dataset) sub('.+?(?=1|2)', '', sub('\\..*', '', dataset),
perl = T)
for (i in 1:4) {

}
## [1] "loading dataset 1: comtrade1962.csv"
## [1] "loading dataset 4: comtrade1965.csv"

This is a trivial example because we only have four datasets due to their large filesize, but these data extend from ’62 to the present, so if we wanted to load all of them, this approach would be significantly faster than copy and pasting filenames. We can also accomplish this using lapply() instead of a loop to write less code; what type of object will the data from the .csv files be in?

trade <- lapply(csvs, read.csv)

Sometimes we might also want to save multiple datasets if we need to access simulated data later or in other scripts. If our simulated datasets are sufficiently large, we might want to be able to read them into R one at a time, to cut down on memory usage, or even to allow different nodes to access them individually in a distributed computing setup. We might also want to save each sample in a nonparametric bootstrap to make replication easier.

Write a loop that generates 100 random variables each from four different distributions, and then combine them with four fake coefficients you create and a $$\mathcal{N}(0,1)$$ error to generate a response variable through a Gaussian data generating process. Combine all five random variables into a dataframe and repeat this process to generate 100 simulated datasets, saving each data frame into a list. Write a loop that saves each dataframe in the list as a separate .csv file with a filename that lets you identify which simulation it contains (you may want to create a subdirectory for this).

beta <- c(.08, -1.4, .89, -.27) # fake coefficients for DGP
sims <- list()
for (i in 1:100) {

x <- cbind(rnorm(100, 2, 4), rgamma(100, 1, 2), rt(100, 12), rpois(100, 32))
y <- x %*% (beta) + rnorm(100, 0, 1)
sims[[i]] <- data.frame(y, x)

}

# create directory to save simulated datasets
dir.create('Simulations', showWarnings = F)

# loop to save .csv files
for (i in 1:100) write.csv(sims[[i]], paste('Simulations/sims_', i, '.csv', sep = ''))

Just like reading in files, we can also do this with an apply family function, although this actually takes a little bit more code to pull off.

invisible(mapply(function(x, y) write.csv(x, paste0('Simulations/sims_', y, '.csv')),
x = sims, y = 1:100))

## Regex for Text Analysis

Many of the operations that packages like quanteda carry out to prepare a corpus for document feature matrix creation are powered by regular expressions. Stopwords can be easily removed with gsub() with replacement = '', capitals can be removed with the base R function tolower(), and words can be stemmed and tokenized with gsub() as well. Luckily, the dfm() function in quanteda does all of these operations simultaneously. This is a great example of how it’s important to work smarter, not harder. If someone’s written a function that combines multiple different steps you need to carry out, use their code to do it in one line instead of having to write five lines.

As with all statistical analysis, one of the biggest parts of text analysis is data preparation. We’re going to be using a subset of the data from Greene and Cross (2017), which are European Parliament speeches. These data are about as well cleaned and prepared as we could hope for, but we still have some work ahead of us. The data are available here. warning: don’t unzip the speeches folder in dropbox (or onedrive or google drive). It contains hundreds of thousands of files and dropbox will spend hours trying to index them all. Just copy the 2009 and 2010 folders into dropbox (or whatever).

# download speeches and metadata
'europarl-data-speeches.zip')

unzip('europarl-data-speeches.zip')
unzip('europarl-metadata.zip')

Often we get get text data pre-processed where each document is contained in an individual text file in subdirectories that represent authors, speakers, or some other meaningful grouping. In this case, we have speeches, in days, in months, in years. Take a look at the readtext() function in the pacakge of the same name, and combine it with list.files() to easily load all the speeches from 2009 and 2010 into a dataframe that quanteda can read.

library(readtext) # easily read in text in directories

# recursively get filepaths for speeches from 09-12
speeches_paths <- list.files(path = c('europarl-data-speeches/2009',
'europarl-data-speeches/2010'),
recursive = T, full.names = T)

speeches <- readtext(speeches_paths)

## Creating a Corpus

The corpus() function in quanteda expects a dataframe where each row contains a document ID and the text of the document itself. Luckily, this is exactly what readtext() has produced for us.

library(quanteda)
speeches <- corpus(speeches)

Quanteda supports two kinds of external information that we can attach to each document in a corpus; metadata and document variables. Metadata are not intended to be used in any analyses, and are often more just notes to the researcher containing things like the source URL for a document, or which language it was originally written in. Look at the documentation for the metadoc() function to see how to attach metadata to our speeches.

metadoc(speeches, field = 'type') <- 'European Parliament Speech'

Document variables, on the other hand, are the structure that we want to be able to use in a structural topic model. Read in the speech and speaker metadata files, merge the speaker data onto the speech data, subset the document variables to just 2009 and 2010, and then assign the combined data to our corpus as document variables with the docvars() function.

# read in speech docvars

speeches_dv <- speeches_dv[year(speeches_dv$date) >= 2009 & year(speeches_dv$date) <= 2010, ]

# merge MEP docvars onto speech metadata
dv <- merge(speeches_dv, MEP_dv, all.x = T,
by.x = 'mep_ids', by.y = 'mep_id')

# merge docvars onto corpus
docvars(speeches) <- dv

# inspect first entries
head(docvars(speeches))

The way that quanteda is written, a corpus is intended to be a static object that other transformations of the text are extracted from. In other words, you won’t be altering the speeches corpus object anymore. Instead, you’ll be creating new objects from it. This approach allows us to conduct analyses with different requirements, such as ones that require stemmed input and those that need punctuation retained, without having to recreate the corpus from scratch each time.

We can also subset corpora based on document level variables using the corpus_subset() function. Create a new corpus object that is a subset of the full speeches corpus only containing speeches made by members of the European People’s Party.

EPP_corp <- corpus_subset(speeches, group_shortname == 'EPP')

We can use the texts() function to access the actual text of each observation.

texts(EPP_corp)[5]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            TEXT_CRE_20090112_1-019.txt
## "European Parliament\nHannes Swoboda,\non behalf of the PSE Group.\n(DE)\nMr President, we have, of course, given this matter a great deal of thought. Perhaps Mr Cohn-Bendit overestimates the significance of a resolution, but with the Security Council's resolution we have a basis which we should support and, as the President of Parliament has already said, we should require both sides to seek peace, to lay down their arms and to comply with the Security Council's resolution. I would, however, just like to add that this must be the gist of our resolution. If this is so, we can support it. In this context we would cooperate and in this context we would support Mr Cohn-Bendit's motion.\n"

## Data Exploration and Descriptive Statistics

We can use the KeyWord in Context function kwic() to quickly see the words around a given word for whenever it appears in our corpus. Try finding the context for every instance of a specific word in our corpus. You might want to pick a slightly uncommon word to keep the output manageable.

kwic(speeches, 'hockey', window = 7)

Right away, we can see two very different contexts that hockey appears in: actual ice hockey and the IPCC hockey stick graph. Any model of text as data that’s worthwhile needs to be able to distinguish between these two different uses.

We can also save the summary of a corpus as a dataframe and use the resulting object to plot some simple summary statistics. By default, the summary.corpus() function only returns the first 100 rows. Use the ndoc() function (which is analogous to nrow() or ncol(), but for the number of documents in a corpus) to get around this. Plot the density of each country’s tokens on the same plot, to explore whether MEPs from some countries are more verbose than others (you’ll be able to see better if you limit the x axis to 1500; there are handful of speeches with way more words that skew the graph).

# extract summary statistics
speeches_df <- summary(speeches, n = ndoc(speeches))

library(ggplot2) # ggplots

# plot density of tokens by country
ggplot(data = speeches_df, aes(x = Tokens, fill = country)) +
geom_density(alpha = .25, linetype = 0) +
theme_bw() +
coord_cartesian(xlim = c(0, 1500)) +
theme(legend.position = 'right',
plot.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank())