Overview

This workshop will illustrate how to use text mining and more specifically, topic modeling (Blei, 2012; Grun B and Hornik, 2011) of a transcribed corpus in order to model linguistic variation and change and will demonstrate the results of such analysis on two different sets of data. The process involves taking a corpus of text and reducing every text, e.g. sociolinguistic interviews, to a set of topics with associated probabilities for each topic.

The extraction and analysis of topic could have several benefits for sociolinguistic research in that it would allow researchers to incorporate a measure of topic into regression models as a gradient variable. Speaker attitudes could be extracted from open-ended questions rather than questions that require a closed answer (e.g. Ethnic Orientation in Hoffman & Walker, 2010) using the same multidimensional reduction analyses. We present different applications of this technique across a variety of language data and for different languages.

The outline of the workshop is below:

Introduction
Preparing the data
Topic modeling
Using topic probabilities in a regression
Application 1: Lexical borrowing in French print media (Gyula Zsombok)
Application 2: Linguistic landscapes on Social Media. (Kate Lyons)
Conclusion & Discussion

R code used to generate results in the workshop will be made available to registrants, but the workshop is not hands-on (so you do not need to install anything prior or bring a laptop): it is a discussion of the methodology, choices researchers will face when using these techniques and applications.

Workshop Materials

Supplementary Material

Notes: including Regression with Topic Models and Sentiment Analysis
Lexical Borrowing Supplementary Code: R code Python
Linguistic Landscapes Supplementary Code R Code
Regular Expressions in R and generally

Text Mining for Sociolinguistic Research Workshop

Joseph Roy, Anna María Escobar, Kate Lyons & Gyula Zsombok

NWAV 45 2016

Overview

Workshop Materials

Supplementary Material