Text Mining for Linguistic Landscape Research

November 3, 2016

Linguistic Landscapes

Linguistic Landscapes is a subfield of sociolinguistics which encompasses study of "the visibility and salience of languages on public and commercial signs in a given territory or region" (Landry and Bourhis 1997).

Linguistic Landscapes on Social Media

Getting the data

With Twitter's search API you can search for tweets that have been 'geotagged': tweets that have longitude and latitude coordinates attached to them.

My research is based in the Mission District neighborhood in San Francisco, so I am using coordinates that will get me tweets from that area. I've chosen a radius of 1 km from this central coordinate but you can expand this radius.

Getting the data

geo=searchTwitter('',n=100000, geocode='37.76,-122.42,1km',
                  retryOnRateLimit=1)

## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 100000 tweets were requested but the
## API can only return 2028

This could take some time depending on how fast your Internet is and how many tweets are available. I was optimistic with my request for 100,000 tweets, so I receive an error message that says how many were actually available. You can play around with this number too if you are lucky enough to have more than 100,000 tweets available.

Processing the data

Now you have a list of tweets. Lists are very difficult to deal with in R, so you convert this into a data frame:

geoDF<-twListToDF(geo)

Processing the data

Chances are there will be emojis in your Twitter data. You can 'transform' these emojis into prose using this code as well as a CSV file I've put together of what all of the emojis look like in R. (The idea for this comes from Jessica Peterka-Bonetta's work – she has a list of emojis as well, but it does not include the newest batch of emojis nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.

Processing the data

Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.

emoticons <- read.csv("Decoded Emojis Col Sep.csv", header = T)

To transform the emojis, you first need to transform the tweet data into ASCII:

geoDF$text <- iconv(geoDF$text, from = "latin1", to = "ascii", 
                    sub = "byte")

Processing the data

To 'count' the emojis you do a find and replace using the CSV file of 'Decoded Emojis' as a reference. Here I am using the DataCombine package. What this does is identifies emojis in the tweeted Instagram posts and then replaces them with a prose version. I used whatever description pops up when hovering one's cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.

emojireplace <- FindReplace(data = geoDF, Var = "text", 
                            replaceData = emoticons,
                       from = "R_Encoding", to = "Name", 
                       exact = FALSE)

Processing the data

Now might be a good time to save this file. I save it in a CSV format with the date of when I collected the posts.

write.csv(emojireplace,file=paste("ALL",Sys.Date(),".csv"))

Processing the data

Now you have a data frame which you can manipulate in various ways. For my research, I'm just interested in posts that have occured on Instagram. (Why not just access them via Instagram's API you ask? Long story short: they are very very conservative about providing access for academic research). I've found a work-around which is filtering mined tweets by those that have Instagram as a source:

data <- emojireplace[emojireplace$statusSource == 
        "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>", ]

#Save this file
write.csv(data,file=paste("INSTA",Sys.Date(),".csv"))

A note about the data

Important: Obviously, data collected this way are not representative of all Instagram posts made in the Mission District (as we depend on people who cross-post to Twitter which is most likely the minority of Mission District Instagrammers) however this is an important point about any data obtained via social media: it's never truly representative. Partly because individuals must be assumed to be selective when they post, as posting is an inherently subjective process, and not everyone is active on social media.

Visualizing the data

Now let's play around with visualizing the data. I want to superimpose different aspects of the tweets I collected on a map. First I have to get a map, which I do using the ggmap package which interacts with Google Map's API. When you use this package, be sure to cite it, as it requests you to when you first load the package into your library.

Visualizing the data

I request a map of the Mission District, and then check to make sure the map is what I want (in terms of zoom, area covered, etc.)

map <- get_map(location = 'Capp St. and 20th, San Francisco,
               California', zoom = 15)

ggmap(map)

Visualizing the data

Looks good to me! Now let's start to visualize our Instagram-via-Twitter data. We can start by seeing where our posts are on a map.

#Tell R what we want to map
data$longitude<-as.numeric(data$longitude)
data$latitude<-as.numeric(data$latitude)
lon <- data$longitude
lat <- data$latitude

For now I just want to look at latitude and longitude, but it is possible to map other aspects as well - it just depends on what you'd like to look at.

Visualizing the data

Now we use ggplot to plot our Instagram data over our map:

mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat), 
                                     data=data, alpha=0.5)

mapPoints

Visualizing the data

We can also look at WHEN the posts were generated. We can make a graph of post frequency over time.Graphs constructed with help from here, here, here, here, here, here, here and here.

#Create a data frame with number of tweets per time
d <- as.data.frame(table(data$created))
d <- d[order(d$Freq, decreasing=T), ]
names(d) <- c("created","freq")
#Combine this with existing data frame
newdata1 <- merge(data,d,by="created")
#Tell R that 'created' is not an integer or factor but a time.
data$created <- as.POSIXct(data$created)

Visualizing the data

Now plot number of tweets over period of time across 20 minute intervals

minutes <- 60
Freq<-data$freq
plot1<-ggplot(data, aes(created)) + geom_freqpoly(binwidth=60*minutes)

Visualizing the data

This might be more informative if you want to look at specific time periods. We can look at the frequency of posts over the course of a specific day if we want.

Visualizing the data

data2 <- data[data$created <= "2016-10-25 00:31:00", ]
minutes <- 60
Freq<-data2$freq
plot2<-ggplot(data2, aes(created)) + geom_freqpoly(binwidth=60*minutes)

Visualizing the data

Topic modeling

Now we will look at how we can use topic modeling for our data. I will be using my larger Instagram corpus of about 7,000 posts available here.

#Packages used
packs = c("topicmodels","slam","Rmpfr","tm","stringr","ggplot2",
          "ggmap","wordcloud","plyr","DataCombine","tidytext","dplyr")
lapply(packs, library, character.only=T)
#Load in data
data=read.csv("Col_Sep_INSTACORPUS.csv", header=T)

Preparing data for Topic Modeling

The data need to be processed a bit more in order to analyze them. First I remove URLs:

data$text = gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", data$text)
#Also get rid of ampersand coding
data$text = gsub("&amp;","", data$text)

Preparing Data for Topic Modeling

I tell R that I want it to think of the text of the tweets as the corpus. I need to label the text as a corpus so I can use operations available from the tm package (Feinerer and Hornik 2015; Feinerer, Hornik and Meyer 2008).

#Tell R that your corpus is the text in the data frame 
#that you have just uploaded
corpus <- Corpus(VectorSource(data$text))
IC=corpus

Preparing Data for Topic Modeling

I use the tm package to clean the data further:

IC = tm_map(IC, stripWhitespace)
IC = tm_map(IC, removePunctuation)
IC = tm_map(IC, tolower)
IC = tm_map(IC, PlainTextDocument)
IC = tm_map(IC, removeNumbers)
#stopwords("english")
IC = tm_map(IC, removeWords, stopwords("english"))
#IC = tm_map(IC, stemDocument, "english")

Analyzing the Data

My corpus is now ready for some linguistic analyses. First I use the TermDocumentMatrix function to create a term-document matrix. This is a matrix of all the terms present in my corpus and their frequency.

tdm <- TermDocumentMatrix(IC, control = list(weighting = weightTf,
                                                tolower = FALSE))

Analyzing the Data

Calculate the term frequency-inverse document frequency, which balances how frequent a term by how frequently it occurs across all documents. This allows us to "this measure allows to omit terms which have low frequency as well as those occurring in many documents" (Grun and Hornik 2011: 12)

tfidf <- weightTfIdf(tdm, normalize = TRUE)

Analyzing the Data

You can use this matrix to explore your data. For example, I can look at terms which have occurred at least 50 times (result is alphabetical, not by relative frequency):

Analyzing the data

findFreqTerms(tfidf, 50)

##  [1] "adoptvintagelove"               "alley"                         
##  [3] "amazing"                        "art"                           
##  [5] "back"                           "bakery"                        
##  [7] "bar"                            "bayarea"                       
##  [9] "bear"                           "beautiful"                     
## [11] "best"                           "birthday"                      
## [13] "brcccius"                       "brows"                         
## [15] "brucius"                        "cafe"                          
## [17] "california"                     "chapel"                        
## [19] "cinema"                         "city"                          
## [21] "clarion"                        "coffee"                        
## [23] "colone"                         "come"                          
## [25] "day"                            "district"                      
## [27] "dog"                            "dolores"                       
## [29] "dolorespark"                    "elbo"                          
## [31] "engraving"                      "etchingea"                     
## [33] "food"                           "foreign"                       
## [35] "francisco"                      "friday"                        
## [37] "friends"                        "fun"                           
## [39] "get"                            "good"                          
## [41] "got"                            "great"                         
## [43] "happy"                          "heavyblackheart"               
## [45] "jasonnevermind"                 "just"                          
## [47] "last"                           "lazy"                          
## [49] "life"                           "like"                          
## [51] "little"                         "love"                          
## [53] "manufactory"                    "mira"                          
## [55] "mission"                        "missionea"                     
## [57] "morning"                        "much"                          
## [59] "muttville"                      "muttvillesf"                   
## [61] "natural"                        "new"                           
## [63] "night"                          "now"                           
## [65] "one"                            "park"                          
## [67] "party"                          "photo"                         
## [69] "posted"                         "room"                          
## [71] "san"                            "sanfrancisco"                  
## [73] "science"                        "see"                           
## [75] "show"                           "smilingfacewithheartshapedeyes"
## [77] "sparkles"                       "street"                        
## [79] "streetart"                      "sunday"                        
## [81] "tartine"                        "tattoo"                        
## [83] "thanks"                         "theea"                         
## [85] "time"                           "today"                         
## [87] "tonight"                        "valencia"                      
## [89] "way"                            "weekend"

Analyzing the Data

I can also make word clouds. Code from here.

dtm=t(tdm)
freq = data.frame(sort(colSums(as.matrix(dtm)), decreasing=TRUE))

Analyzing the Data

Topic Modeling

On to topic modeling! I'll be using an Latent Dirichlet Allocation and Gibbs Sampling. There are different ways to topic model using different algorithms, but this will be the one we'll try out today.

Topic Modeling

So let's set up the parameters for the topic model

#Set parameters for Gibbs sampling (parameters those used in
#Grun and Hornik 2011)
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

Topic Modeling

I set the number of topics to 10 (I found this number to work best in terms of 'making sense' – but this can change depending on how much data you have, how varied the data is, etc.)

k<-10
#For the model we are using the document term matrix (dtm) 
#NOT the term document matrix (tdm)
# #Run the model
# ldaOut <-LDA(dtm,k, method="Gibbs",
#            control=list(nstart=nstart, seed = seed, best=best,
#                         burnin = burnin, iter = iter, thin=thin))
# #Save the output
# save(ldaOut,file="ldaOut.RData")

Topic Modeling

Now we can start to explore the results. The topic model has gone through our corpus and assigned each term to a topic. Each term is also given a probability, which is the probability that it will occur in the assigned topic. We can look at the top terms in each topic to get an idea of the most probable words for each topic and come up with a qualitative description.

#Load the output back in
load("ldaOut.RData")
#Look at results
ldaOut.topics <- as.matrix(topics(ldaOut))
topTenTermsEachTopic <- terms(ldaOut,10) 
#print(topTenTermsEachTopic)

Topic Model Results

I can now create a CSV (comma separated value) file to look at in Excel to get a better idea.

#Check at top 50 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,50))
#Save as CSV file to look at a bit closer
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicstoTerms.csv"))

Topic Model Results

We can also get a look at the probabilities associated with each topic assignment. What we want is to eventually use these probabilities in a regression. Does an association with one of these topics predict a variable of interest?

topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,
          file=paste("LDAGibbs",k,"TopicProbabilities.csv"))

Topic Model Results

Having looked at the top 50 most probable terms for each topic, I've come up with some labels to describe them. What I want to do now is link back those topics to the tweets that have been assigned them. This involves joining two data frames together, and then replacing the numbers that describe my topics with my own invented prose versions.

#Write out the topics to a data frame so you can work with them
test <- as.data.frame(ldaOut.topics)
a<-c('Evaluation', 'Food','Service/Product Promos', 'Activities', 'Outdoors',
'Art', 'Places','Nightlife','Leisure','Hip Spots')
b<-c(1,2,3,4,5,6,7,8,9,10)
namesdf<-data.frame("Name"=a,"Number"=b)
test$V1<-as.factor(test$V1)
newtopics <- FindReplace(data = test, Var = "V1", replaceData = namesdf,
                       from = "Number", to = "Name", exact = TRUE)

## Only exact matches will be replaced.

Topic Model Results

#Merge topics with tweet corpus
data$ID <- 1:nrow(data)
newtopics$ID <- 1:nrow(newtopics)
topicdata <- merge(data,newtopics,by="ID")

#Merge topic probabilities with tweet corpus
topicProbabilities$ID <- 1:nrow(topicProbabilities)
newdata <- merge(topicdata, topicProbabilities,by="ID")
write.csv(newdata,file=paste("Tweetswtopics.csv"))
newdata=read.csv("Tweetswtopics.csv", header=T)

Visualizing Topic Model Results

You can now map your posts and see where assigned topics are happening!

newdata$longitude<-as.numeric(newdata$longitude)
newdata$latitude <- as.numeric(newdata$latitude)
lon<-newdata$longitude
lat<-newdata$latitude
newdata$V1.x <- factor(newdata$V1.x)
Topics<-newdata$V1.x
mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, 
                                         color=Topics), 
                                     data=newdata, alpha=0.5)

mapPoints

Visualizing Topic Model Results

This can be kind of messy, so we can subset our data to just look at particular topics.

#Subset the data by all those posts NOT categorized as Promos, etc.
sub <- newdata[! newdata$V1.x %in% c("Activities",
               "Places","Service/Product Promos"),]

Visualizing Topic Model Results

Look at the simplified map

sub$longitude<-as.numeric(sub$longitude)
sub$latitude <- as.numeric(sub$latitude)
lon<-sub$longitude
lat<-sub$latitude
sub$V1.x <- as.factor(sub$V1.x)
Topics<- sub$V1.x

mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, 
              color=Topics), data=sub, alpha=0.5)

mapPoints

Visualizing Topic Model Results

We can zoom into particular areas too to take a closer look at what is going on:

map2 <- get_map(location = 'Dolores Street and 19th Street, 
                San Francisco, California', zoom = 17)
mapPoints2 <- ggmap(map2) + geom_point(aes(x = lon, y = lat, 
                                           color=Topics), 
                                     data=sub, alpha=0.5)

mapPoints2

Visualizing Topic Model Results

map3 <- get_map(location = 'Capp Street and 24th Street, 
                San Francisco, California', zoom = 17)
mapPoints3 <- ggmap(map3) + geom_point(aes(x = lon, y = lat, 
                                           color=Topics), 
                                     data=sub, alpha=0.5)

mapPoints3

Visualizing Topic Model Results

map4 <- get_map(location = '595 Alabama St, San Francisco, CA 94110', zoom = 17)
mapPoints4 <- ggmap(map4) + geom_point(aes(x = lon, y = lat, 
                                           color=Topics), 
                                     data=sub, alpha=0.5)

mapPoints4

Visualizing Topic Model Results

We can also see how topics occur over time

#Create a data frame with number of tweets per time
d2 <- as.data.frame(table(newdata$created))
d2 <- d2[order(d2$Freq, decreasing=T), ]
names(d2) <- c("created","freq")
#Combine this with existing data frame
newdata2 <- merge(newdata,d2,by="created")
#Tell R that 'created' is not an integer or factor but a time.
newdata2$created <- as.POSIXct(newdata2$created, format="%m/%d/%Y %H:%M")
#20 minute intervals
minutes <- 60
Topics<-newdata2$V1.x

ggplot(newdata2, aes(created, color = Topics)) + 
  geom_freqpoly(binwidth=60*minutes)

Visualizing Topic Model Results

newdata3 <- newdata2[newdata2$created <= "0016-08-03 00:31:00", ]
minutes <- 60
Topics<-newdata3$V1.x
Freq<-newdata3$freq

ggplot(newdata3, aes(created, color = Topics)) + 
  geom_freqpoly(binwidth=60*minutes)

Visualizing Topic Model Results

I can also look at the frequency of these topics over time in a more abstract sense, by treating the posts as happening in one day to see overall patterns.

newdata$created2 <- as.POSIXct(newdata$created, format="%m/%d/%Y %H:%M")
newdata$created3<-format(newdata$created2,'%H:%M:%S')
d3 <- as.data.frame(table(newdata$created3))
d3 <- d3[order(d3$Freq, decreasing=T), ]
names(d3) <- c("created3","freq3")
newdata <- merge(newdata,d3,by="created3")
newdata$created3 <- as.POSIXct(newdata$created3, format="%H:%M:%S")
minutes <- 60
Topics<-newdata$V1.x
overalltimes <- ggplot(newdata, aes(created3, color = Topics)) + 
  geom_freqpoly(binwidth=60*minutes)

print(overalltimes)