NWAV 45 2016

Introduction

Text Mining

Text Mining/Analysis/Analytics = Using the unstructured (mostly lexical) data to model some kind of information.

Objectives

  1. If you know and use R , we want you to leave the workshop with the ability to apply what we've talked about to your own data or a collaborator's data as well as have an understanding of the basic methodology.

  2. If you aren't profficient with R , we want you to leave the workshop with an understanding of the methodology of these techniques and how you might be able to apply them to your own work.

Why now?

  1. Many packages (tm, RWeka) are available to analyze, tag for part of speech and syntactically parse data.
  2. Tidytext (July 2016)
  3. Other Behavioral and Social Science (see references)
  4. Increase Skills for Alt/Post-Ac Positions

Data

  1. International Corpus of English: http://ice-corpora.net/ice/
  2. Canadian Corpus
  • Private Spoken Section (Phone and face-to-face interviews)

Preparing the Data

Processing

Any text analysis, regardless of language, generally involves some of the following steps:

  1. Read text files into R.
  2. Remove white space, meta-data, punctuation, numbers and non-UTF-8 characters [anything not a lexical item to be included in the analysis] and make everything lower case.
  3. Remove stopwords .
  4. Stem the document.

The order and inclusion of different steps (specifically 3 and 4) can change based on the data and research question of interest.

The Text

Tagged Corpus

Pre-processing (Read in Data)

Remove extra white space, meta-data, punctuation, numbers and non-UTF-8 characters [anything not a lexical item to be included in the analysis] and make everything lower case.

Original Data

Removing Stop Words

Stop Words

Uninformative words removed: These are thought to be unhelpful in determining topic. Linguists would refer to these as function words (versus content words)

Stopwords

Removing Stop Words

Stopwords (English)

## Loading required package: NLP
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

Stopwords (French)

##   [1] "au"       "aux"      "avec"     "ce"       "ces"      "dans"    
##   [7] "de"       "des"      "du"       "elle"     "en"       "et"      
##  [13] "eux"      "il"       "je"       "la"       "le"       "leur"    
##  [19] "lui"      "ma"       "mais"     "me"       "même"     "mes"     
##  [25] "moi"      "mon"      "ne"       "nos"      "notre"    "nous"    
##  [31] "on"       "ou"       "par"      "pas"      "pour"     "qu"      
##  [37] "que"      "qui"      "sa"       "se"       "ses"      "son"     
##  [43] "sur"      "ta"       "te"       "tes"      "toi"      "ton"     
##  [49] "tu"       "un"       "une"      "vos"      "votre"    "vous"    
##  [55] "c"        "d"        "j"        "l"        "à"        "m"       
##  [61] "n"        "s"        "t"        "y"        "été"      "étée"    
##  [67] "étées"    "étés"     "étant"    "suis"     "es"       "est"     
##  [73] "sommes"   "êtes"     "sont"     "serai"    "seras"    "sera"    
##  [79] "serons"   "serez"    "seront"   "serais"   "serait"   "serions" 
##  [85] "seriez"   "seraient" "étais"    "était"    "étions"   "étiez"   
##  [91] "étaient"  "fus"      "fut"      "fûmes"    "fûtes"    "furent"  
##  [97] "sois"     "soit"     "soyons"   "soyez"    "soient"   "fusse"   
## [103] "fusses"   "fût"      "fussions" "fussiez"  "fussent"  "ayant"   
## [109] "eu"       "eue"      "eues"     "eus"      "ai"       "as"      
## [115] "avons"    "avez"     "ont"      "aurai"    "auras"    "aura"    
## [121] "aurons"   "aurez"    "auront"   "aurais"   "aurait"   "aurions" 
## [127] "auriez"   "auraient" "avais"    "avait"    "avions"   "aviez"   
## [133] "avaient"  "eut"      "eûmes"    "eûtes"    "eurent"   "aie"     
## [139] "aies"     "ait"      "ayons"    "ayez"     "aient"    "eusse"   
## [145] "eusses"   "eût"      "eussions" "eussiez"  "eussent"  "ceci"    
## [151] "cela"     "celà"     "cet"      "cette"    "ici"      "ils"     
## [157] "les"      "leurs"    "quel"     "quels"    "quelle"   "quelles" 
## [163] "sans"     "soi"

Stopwords (Supported)

Current languages supported: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, Catalan, Romanian and smart-english.

Stemming

Removes uninformative morphological endings (e.g. horse, horses become hors or winning, winner and win become win. The following languages are supported:

##  [1] "danish"     "dutch"      "english"    "finnish"    "french"    
##  [6] "german"     "hungarian"  "italian"    "norwegian"  "porter"    
## [11] "portuguese" "romanian"   "russian"    "spanish"    "swedish"   
## [16] "turkish"

Stopwords

Stemming Words

Text Mining

Pre-processing > Sentiment Analysis

Pre-processing > Stop words > Stemming > Topic Models (LDA)

Pre-processing > Stop words > Stemming > Other Techniques

What does this look like in R?

library(tm)
library(tidytext)
library(topicmodels)
library(dplyr)
library(stringr)
library(ggplot2)
library(mgcv)
library(visreg)

ICE_CE = Corpus(DirSource(directory = 
                            
     "C:/Users/Joe/Desktop/ICE Canadian Corpus", 
     
     pattern = "\\.(txt)$"))

id = names(ICE_CE)

What does this look like in R?

ICE_CE <- tm_map(ICE_CE, removeNumbers)

ICE_CE <- tm_map(ICE_CE, tolower)

ICE_CE <- tm_map(ICE_CE, PlainTextDocument)

tidyCanadian = tidy(ICE_CE)

# The file name gets lost
tidyCanadian$id = id

What does this look like in R?

spokenCan = tidyCanadian %>% 
filter(grepl("S1A|a", id)) %>% 
unnest_tokens(line, text, token = "lines") %>% 
group_by(id) %>% 
mutate(linenumber = row_number()) %>% 
ungroup()

What does this look like in R?

spokenCan= spokenCan %>%
  
  mutate(line = gsub("<.*?>", "", line)) %>%
  
  mutate(line = gsub(" '", "'", line)) %>%
  
  mutate(id = gsub("S1A-", "", id)) %>%
  
  mutate(id = gsub("S1a-", "", id)) %>%
  
  mutate(id = gsub(".txt", "", id), 
         
         uniqid = paste(id,"_",linenumber,sep=""))

Topic Modeling

Topic Modeling

From Blei (2012: 77) Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents. Large is greater than 100 documents or texts.The number of topics, k, should be much less than the number of documents in your corpus. 100 documents \(>>\) k=20 topics.

Intuition for Topic Modeling

Topic Modeling Process

Intuition for Topic Modeling

Topic Modeling Process

Each topic has a set of probabilities associated with each lexical item; The same lexical item can be associated with multiple topics. These probabilities can be used to interpret what the topic might mean.

Intuition for Topic Modeling

Topic Modeling Process

Each document can be classified into a topic by summing all the lexical weights associated with that topic in the document and taking the highest one to be the classification.

Topics

  • Assumes a collection of texts that can be reduced to a set of topics.
  • Topic is sometimes equivalent to genre or to sociolinguistic topic (e.g. School, Work, etc. ).
  • The interpretation of topic is local to the research question and the data.
  • Proxy for Social Attitudes? Ethnic Orientation (Hoffman and Walker, 2010)? Language Attitudes?

DTM

A Document Term Matrix (DTM) is a matrix where columns are your unique documents (e.g. each interview), rows are all the unique lexical items in your corpus and each cell is a count.

Document Term Matrix

DTM (Visual)

# Word count by unique lexical item

by_line_word <- spokenCan %>% 
unnest_tokens(word, line)

word_counts = by_line_word %>% 
count(id, word, sort = TRUE) %>% 
ungroup()

can_dtm = word_counts %>% 
cast_dtm(id, word, n)

Latent Dirichlet Allocaiton (LDA)

Essentially, this process reduces your DTM to a set of k topics. Each lexical item has a probability it is in Topic\(_i\) where i goes from 1 to k. Interpreting what each topic may mean is to the analysts and requires examining the top terms for each topic (as well as which documents get classified into each topic.

Latent Semantic Analysis, in linguistics and psychology, did the same thing as topic modeling, but used Principle Component Analysis on produced speech.

https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html

LDA Tuning package can help determine optimal numbmer of topics. For large number of possibilities, requires fast computer or leaving your computer running for a week.

Performing LDA in R

canTopics <- LDA(can_dtm, k = 5, control = list(seed = 1234))
tidyCanTopics <- tidytext:::tidy.LDA(canTopics)

See notes for tuning parameter/samplers.

Top Terms for Each Topic

top_terms <- tidyCanTopics %>% 
group_by(topic) %>% 
top_n(10, beta) %>% 
ungroup() %>% 
arrange(topic, -beta)

top_terms %>% 
mutate(term = reorder(term, beta)) %>% 
ggplot(aes(term, beta, fill = factor(topic))) + 
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + 
    
facet_wrap(~topic, scales = "free")

Top Terms for Topic 1

Topic 1

Top Terms for Topic 8

Topic 8

Researcher Degrees of Freedom

  1. Document decisions from time of data collection.
  2. Explore methods on subset of data – run confirmatory analysis on full data.
  3. Publish a research journal documenting decisions (e.g. in Rmarkdown).
  4. Perform a multiverse analysis (Steegen, Tuerlinkcx, Gelman and Vanpaemel, 2016)

Using in Regression

Topic modeling in LVC

We can easily included the topic probability for each interview (or smaller component) into a regression analysis either via glm() or, more realistically, gam(): Generalized Additive Models work by fitting smooth functions to continuous predictors. Here we do not expect topic probability to linearly to increase or decrease with respect.

Do Support and Topic

Do support: I did not know John.

Raising: I knew not John.

Source: Penn-Parsed Corpus of Early Modern English.

Do Support and Topic

load("gam1.RData")
summary(gam1)$formula
## type == "inversion" ~ s(topict1) + s(topict2) + s(topict3) + 
##     s(topict3) + s(topict5) + s(topict6)
summary(gam1)$s.table
##                 edf   Ref.df    Chi.sq      p-value
## s(topict1) 8.512208 8.883258 75.167438 7.927550e-13
## s(topict2) 1.000006 1.000011 18.330619 1.857165e-05
## s(topict3) 1.276552 1.483578  2.798771 1.842087e-01
## s(topict5) 4.115305 5.019976  7.574573 1.868884e-01
## s(topict6) 8.089833 8.749664 16.550181 4.242332e-02

Coverage

## [1] 0.2449593

Applications

Focus is on Methodology

  1. Lexical Borrowing.
  2. Linguistic Landscapes with Twitter.

Conclusions

Summary

  1. Topics (or there sociolinguistic equivalent) can be included in models of LVC.
  2. The machinery underlying topic models can be used to analyze the unstructured data that sociolinguists have available in their corpora of spoken language.
  3. Reliability, Replicability and Reproducibailty with these methods can be achieved by documenting decisions in a transparent and reproducible manner in the topic fitting process.

Acknowledgements

  • University of Illinois at Urbana-Champaign
    • Office of the Vice-Chancellor for Research
    • School of Literatures, Cultures and Linguistics
    • Graduate College
  • Participants in the Text Mining Workshop at SatRDays in Urbana, Illinois.