NWAV 45 2016
Text Mining/Analysis/Analytics = Using the unstructured (mostly lexical) data to model some kind of information.
If you know and use R , we want you to leave the workshop with the ability to apply what we've talked about to your own data or a collaborator's data as well as have an understanding of the basic methodology.
If you aren't profficient with R , we want you to leave the workshop with an understanding of the methodology of these techniques and how you might be able to apply them to your own work.
Any text analysis, regardless of language, generally involves some of the following steps:
The order and inclusion of different steps (specifically 3 and 4) can change based on the data and research question of interest.
Remove extra white space, meta-data, punctuation, numbers and non-UTF-8 characters [anything not a lexical item to be included in the analysis] and make everything lower case.
Uninformative words removed: These are thought to be unhelpful in determining topic. Linguists would refer to these as function words (versus content words)
## Loading required package: NLP
## [1] "i" "me" "my" "myself" "we" ## [6] "our" "ours" "ourselves" "you" "your" ## [11] "yours" "yourself" "yourselves" "he" "him" ## [16] "his" "himself" "she" "her" "hers" ## [21] "herself" "it" "its" "itself" "they" ## [26] "them" "their" "theirs" "themselves" "what" ## [31] "which" "who" "whom" "this" "that" ## [36] "these" "those" "am" "is" "are" ## [41] "was" "were" "be" "been" "being" ## [46] "have" "has" "had" "having" "do" ## [51] "does" "did" "doing" "would" "should" ## [56] "could" "ought" "i'm" "you're" "he's" ## [61] "she's" "it's" "we're" "they're" "i've" ## [66] "you've" "we've" "they've" "i'd" "you'd" ## [71] "he'd" "she'd" "we'd" "they'd" "i'll" ## [76] "you'll" "he'll" "she'll" "we'll" "they'll" ## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't" ## [86] "haven't" "hadn't" "doesn't" "don't" "didn't" ## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" ## [96] "cannot" "couldn't" "mustn't" "let's" "that's" ## [101] "who's" "what's" "here's" "there's" "when's" ## [106] "where's" "why's" "how's" "a" "an" ## [111] "the" "and" "but" "if" "or" ## [116] "because" "as" "until" "while" "of" ## [121] "at" "by" "for" "with" "about" ## [126] "against" "between" "into" "through" "during" ## [131] "before" "after" "above" "below" "to" ## [136] "from" "up" "down" "in" "out" ## [141] "on" "off" "over" "under" "again" ## [146] "further" "then" "once" "here" "there" ## [151] "when" "where" "why" "how" "all" ## [156] "any" "both" "each" "few" "more" ## [161] "most" "other" "some" "such" "no" ## [166] "nor" "not" "only" "own" "same" ## [171] "so" "than" "too" "very"
## [1] "au" "aux" "avec" "ce" "ces" "dans" ## [7] "de" "des" "du" "elle" "en" "et" ## [13] "eux" "il" "je" "la" "le" "leur" ## [19] "lui" "ma" "mais" "me" "même" "mes" ## [25] "moi" "mon" "ne" "nos" "notre" "nous" ## [31] "on" "ou" "par" "pas" "pour" "qu" ## [37] "que" "qui" "sa" "se" "ses" "son" ## [43] "sur" "ta" "te" "tes" "toi" "ton" ## [49] "tu" "un" "une" "vos" "votre" "vous" ## [55] "c" "d" "j" "l" "à " "m" ## [61] "n" "s" "t" "y" "été" "étée" ## [67] "étées" "étés" "étant" "suis" "es" "est" ## [73] "sommes" "êtes" "sont" "serai" "seras" "sera" ## [79] "serons" "serez" "seront" "serais" "serait" "serions" ## [85] "seriez" "seraient" "étais" "était" "étions" "étiez" ## [91] "étaient" "fus" "fut" "fûmes" "fûtes" "furent" ## [97] "sois" "soit" "soyons" "soyez" "soient" "fusse" ## [103] "fusses" "fût" "fussions" "fussiez" "fussent" "ayant" ## [109] "eu" "eue" "eues" "eus" "ai" "as" ## [115] "avons" "avez" "ont" "aurai" "auras" "aura" ## [121] "aurons" "aurez" "auront" "aurais" "aurait" "aurions" ## [127] "auriez" "auraient" "avais" "avait" "avions" "aviez" ## [133] "avaient" "eut" "eûmes" "eûtes" "eurent" "aie" ## [139] "aies" "ait" "ayons" "ayez" "aient" "eusse" ## [145] "eusses" "eût" "eussions" "eussiez" "eussent" "ceci" ## [151] "cela" "celà " "cet" "cette" "ici" "ils" ## [157] "les" "leurs" "quel" "quels" "quelle" "quelles" ## [163] "sans" "soi"
Current languages supported: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, Catalan, Romanian and smart-english.
Removes uninformative morphological endings (e.g. horse, horses become hors or winning, winner and win become win. The following languages are supported:
## [1] "danish" "dutch" "english" "finnish" "french" ## [6] "german" "hungarian" "italian" "norwegian" "porter" ## [11] "portuguese" "romanian" "russian" "spanish" "swedish" ## [16] "turkish"
Pre-processing > Sentiment Analysis
Pre-processing > Stop words > Stemming > Topic Models (LDA)
Pre-processing > Stop words > Stemming > Other Techniques
library(tm) library(tidytext) library(topicmodels) library(dplyr) library(stringr) library(ggplot2) library(mgcv) library(visreg) ICE_CE = Corpus(DirSource(directory = "C:/Users/Joe/Desktop/ICE Canadian Corpus", pattern = "\\.(txt)$")) id = names(ICE_CE)
ICE_CE <- tm_map(ICE_CE, removeNumbers) ICE_CE <- tm_map(ICE_CE, tolower) ICE_CE <- tm_map(ICE_CE, PlainTextDocument) tidyCanadian = tidy(ICE_CE) # The file name gets lost tidyCanadian$id = id
spokenCan = tidyCanadian %>% filter(grepl("S1A|a", id)) %>% unnest_tokens(line, text, token = "lines") %>% group_by(id) %>% mutate(linenumber = row_number()) %>% ungroup()
spokenCan= spokenCan %>% mutate(line = gsub("<.*?>", "", line)) %>% mutate(line = gsub(" '", "'", line)) %>% mutate(id = gsub("S1A-", "", id)) %>% mutate(id = gsub("S1a-", "", id)) %>% mutate(id = gsub(".txt", "", id), uniqid = paste(id,"_",linenumber,sep=""))
From Blei (2012: 77) Topic models are algorithms for discovering the main themes that pervade a large and otherwise unstructured collection of documents. Large is greater than 100 documents or texts.The number of topics, k, should be much less than the number of documents in your corpus. 100 documents \(>>\) k=20 topics.
Each topic has a set of probabilities associated with each lexical item; The same lexical item can be associated with multiple topics. These probabilities can be used to interpret what the topic might mean.
Each document can be classified into a topic by summing all the lexical weights associated with that topic in the document and taking the highest one to be the classification.
A Document Term Matrix (DTM) is a matrix where columns are your unique documents (e.g. each interview), rows are all the unique lexical items in your corpus and each cell is a count.
# Word count by unique lexical item by_line_word <- spokenCan %>% unnest_tokens(word, line) word_counts = by_line_word %>% count(id, word, sort = TRUE) %>% ungroup() can_dtm = word_counts %>% cast_dtm(id, word, n)
Essentially, this process reduces your DTM to a set of k topics. Each lexical item has a probability it is in Topic\(_i\) where i goes from 1 to k. Interpreting what each topic may mean is to the analysts and requires examining the top terms for each topic (as well as which documents get classified into each topic.
Latent Semantic Analysis, in linguistics and psychology, did the same thing as topic modeling, but used Principle Component Analysis on produced speech.
https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html
LDA Tuning package can help determine optimal numbmer of topics. For large number of possibilities, requires fast computer or leaving your computer running for a week.
canTopics <- LDA(can_dtm, k = 5, control = list(seed = 1234)) tidyCanTopics <- tidytext:::tidy.LDA(canTopics)
See notes for tuning parameter/samplers.
top_terms <- tidyCanTopics %>% group_by(topic) %>% top_n(10, beta) %>% ungroup() %>% arrange(topic, -beta) top_terms %>% mutate(term = reorder(term, beta)) %>% ggplot(aes(term, beta, fill = factor(topic))) + geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) + facet_wrap(~topic, scales = "free")
We can easily included the topic probability for each interview (or smaller component) into a regression analysis either via glm() or, more realistically, gam(): Generalized Additive Models work by fitting smooth functions to continuous predictors. Here we do not expect topic probability to linearly to increase or decrease with respect.
Do support: I did not know John.
Raising: I knew not John.
Source: Penn-Parsed Corpus of Early Modern English.
load("gam1.RData") summary(gam1)$formula
## type == "inversion" ~ s(topict1) + s(topict2) + s(topict3) + ## s(topict3) + s(topict5) + s(topict6)
summary(gam1)$s.table
## edf Ref.df Chi.sq p-value ## s(topict1) 8.512208 8.883258 75.167438 7.927550e-13 ## s(topict2) 1.000006 1.000011 18.330619 1.857165e-05 ## s(topict3) 1.276552 1.483578 2.798771 1.842087e-01 ## s(topict5) 4.115305 5.019976 7.574573 1.868884e-01 ## s(topict6) 8.089833 8.749664 16.550181 4.242332e-02
## [1] 0.2449593