This workshop will illustrate how to use text mining and more specifically, topic modeling (Blei, 2012; Grun B and Hornik, 2011) of a transcribed corpus in order to model linguistic variation and change and will demonstrate the results of such analysis on two different sets of data. The process involves taking a corpus of text and reducing every text, e.g. sociolinguistic interviews, to a set of topics with associated probabilities for each topic.

The extraction and analysis of topic could have several benefits for sociolinguistic research in that it would allow researchers to incorporate a measure of topic into regression models as a gradient variable. Speaker attitudes could be extracted from open-ended questions rather than questions that require a closed answer (e.g. Ethnic Orientation in Hoffman & Walker, 2010) using the same multidimensional reduction analyses. We present different applications of this technique across a variety of language data and for different languages.

The outline of the workshop is below:

  1. Introduction
  2. Preparing the data
  3. Topic modeling
  4. Using topic probabilities in a regression
  5. Application 1: Lexical borrowing in French print media (Gyula Zsombok)
  6. Application 2: Linguistic landscapes on Social Media. (Kate Lyons)
  7. Conclusion & Discussion

R code used to generate results in the workshop will be made available to registrants, but the workshop is not hands-on (so you do not need to install anything prior or bring a laptop): it is a discussion of the methodology, choices researchers will face when using these techniques and applications.

Supplementary Material

  1. Notes: including Regression with Topic Models and Sentiment Analysis
  2. Lexical Borrowing Supplementary Code: R code Python
  3. Linguistic Landscapes Supplementary Code R Code
  4. Regular Expressions in R and generally