Text Mining for Linguistic Landscape Research

November 3, 2016

Linguistic Landscapes

Linguistic Landscapes is a subfield of sociolinguistics which encompasses study of "the visibility and salience of languages on public and commercial signs in a given territory or region" (Landry and Bourhis 1997).

Linguistic Landscapes on Social Media

Getting the data

With Twitter's search API you can search for tweets that have been 'geotagged': tweets that have longitude and latitude coordinates attached to them.

My research is based in the Mission District neighborhood in San Francisco, so I am using coordinates that will get me tweets from that area. I've chosen a radius of 1 km from this central coordinate but you can expand this radius.

To collect tweets you first need to set up a 'developer account' at Twitter and create a new application. This allows you to access Twitter's API. Detailed instructions on how to do this are available in Nathan Danneman and Richard Heimann's book Social Media Mining with R (highly recommended) or via other sources on the web.

Getting the data

geo=searchTwitter('',n=100000, geocode='37.76,-122.42,1km',
                  retryOnRateLimit=1)

## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 100000 tweets were requested but the
## API can only return 2033

This could take some time depending on how fast your Internet is and how many tweets are available. I was optimistic with my request for 100,000 tweets, so I receive an error message that says how many were actually available. You can play around with this number too if you are lucky enough to have more than 100,000 tweets available.

Processing the data

Now you have a list of tweets. Lists are very difficult to deal with in R, so you convert this into a data frame:

geoDF<-twListToDF(geo)

Processing the data

Chances are there will be emojis in your Twitter data. You can 'transform' these emojis into prose using this code as well as a CSV file I've put together of what all of the emojis look like in R. (The idea for this comes from Jessica Peterka-Bonetta's work – she has a list of emojis as well, but it does not include the newest batch of emojis nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.

Processing the data

Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.

emoticons <- read.csv("Decoded Emojis Col Sep.csv", header = T)

To transform the emojis, you first need to transform the tweet data into ASCII:

geoDF$text <- iconv(geoDF$text, from = "latin1", to = "ascii", 
                    sub = "byte")

Processing the data

To 'count' the emojis you do a find and replace using the CSV file of 'Decoded Emojis' as a reference. Here I am using the DataCombine package. What this does is identifies emojis in the tweeted Instagram posts and then replaces them with a prose version. I used whatever description pops up when hovering one's cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.

emojireplace <- FindReplace(data = geoDF, Var = "text", 
                            replaceData = emoticons,
                       from = "R_Encoding", to = "Name", 
                       exact = FALSE)

Processing the data

Now might be a good time to save this file. I save it in a CSV format with the date of when I collected the posts.

write.csv(emojireplace,file=paste("ALL",Sys.Date(),".csv"))

Processing the data

Now you have a data frame which you can manipulate in various ways. For my research, I'm just interested in posts that have occured on Instagram. (Why not just access them via Instagram's API you ask? Long story short: they are very very conservative about providing access for academic research). I've found a work-around which is filtering mined tweets by those that have Instagram as a source:

data <- emojireplace[emojireplace$statusSource == 
        "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>", ]

#Save this file
write.csv(data,file=paste("INSTA",Sys.Date(),".csv"))

A note about the data

Important: Obviously, data collected this way are not representative of all Instagram posts made in the Mission District (as we depend on people who cross-post to Twitter which is most likely the minority of Mission District Instagrammers) however this is an important point about any data obtained via social media: it's never truly representative. Partly because individuals must be assumed to be selective when they post, as posting is an inherently subjective process, and not everyone is active on social media.

Visualizing the data

Now let's play around with visualizing the data. I want to superimpose different aspects of the tweets I collected on a map. First I have to get a map, which I do using the ggmap package which interacts with Google Map's API. When you use this package, be sure to cite it, as it requests you to when you first load the package into your library.

Visualizing the data

I request a map of the Mission District, and then check to make sure the map is what I want (in terms of zoom, area covered, etc.)

map <- get_map(location = 'Capp St. and 20th, San Francisco,
               California', zoom = 15)

Visualizing the data

ggmap(map)

Visualizing the data

Looks good to me! Now let's start to visualize our Instagram-via-Twitter data. We can start by seeing where our posts are on a map.

#Tell R what we want to map
data$longitude<-as.numeric(data$longitude)
data$latitude<-as.numeric(data$latitude)
lon <- data$longitude
lat <- data$latitude

For now I just want to look at latitude and longitude, but it is possible to map other aspects as well - it just depends on what you'd like to look at.

Visualizing the data

Now we use ggplot to plot our Instagram data over our map:

mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat), data=data, alpha=0.5)

Visualizing the data

Voila! You have a map of your data.

mapPoints

Visualizing the data

We can also look at WHEN the posts were generated. We can make a graph of post frequency over time.Graphs constructed with help from here, here, here, here, here, [here] (https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXlt.html), here and here.

#Create a data frame with number of tweets per time
d <- as.data.frame(table(data$created))
d <- d[order(d$Freq, decreasing=T), ]
names(d) <- c("created","freq")
#Combine this with existing data frame
data <- merge(data,d,by="created")
#Tell R that 'created' is not an integer or factor but a time.
data$created <- as.POSIXct(data$created)

Visualizing the data

Now plot number of tweets over period of time across 20 minute intervals

minutes <- 60
Freq<-data$freq
plot1 <- ggplot(data, aes(created)) + geom_freqpoly(binwidth=60*minutes)

Visualizing the data

This might be more informative if you want to look at specific time periods. We can look at the frequency of posts over the course of a specific day if we want:

data2 <- data[data$created <= "0016-08-03 00:31:00", ]
minutes <- 60
Freq<-data2$freq
plot2 <- ggplot(data2, aes(created)) + geom_freqpoly(binwidth=60*minutes)

Visualizing the data

Topic modeling

Now we will look at how we can use topic modeling for our data. I will be using my larger Instagram corpus of about 7,000 posts available here

#Packages used
packs = c("topicmodels","slam","Rmpfr","tm","stringr","ggplot2","ggmap","wordcloud","plyr","DataCombine")
lapply(packs, library, character.only=T)
#Load in data
data=read.csv("Col_Sep_INSTACORPUS.csv", header=T)

Preparing data for Topic Modeling

The data need to be processed a bit more in order to analyze them. First I remove URLs:

data$text = gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", data$text)
#Also get rid of ampersand coding
data$text = gsub("&amp;","", data$text)

Preparing Data for Topic Modeling

I tell R that I want it to think of the text of the tweets as the corpus. I need to label the text as a corpus so I can use operations available from the tm package (Feinerer and Hornik 2015; Feinerer, Hornik and Meyer 2008).

#Tell R that your corpus is the text in the data frame that you have just uploaded
corpus <- Corpus(VectorSource(data$text))
IC=corpus

Preparing Data for Topic Modeling

I use the tm package to clean the data further:

#Get rid of whitespace
IC = tm_map(IC, stripWhitespace)
#Get rid of punctuation
IC = tm_map(IC, removePunctuation)
#Make everything lowercase (i.e. don't want to represent Good and good as two different words)
IC = tm_map(IC, tolower)
#Turn it into a plain text document which we can work with 
IC = tm_map(IC, PlainTextDocument)
#Get rid of numbers
IC = tm_map(IC, removeNumbers)
#Get rid of stopwords. Available stopword lists can be ascertained with the following command -- you will get an error if stop words do not exist for that language. You could also build your own list.
#stopwords("english")
IC = tm_map(IC, removeWords, stopwords("english"))
#You might want to skip this last step if you are interested in seeing entire words -- helpful if you want to generate a word cloud with full words, etc.
#IC = tm_map(IC, stemDocument, "english")

Analyzing the Data

My corpus is now ready for some linguistic analyses. First I use the TermDocumentMatrix function to create a term-document matrix. This is a matrix of all the terms present in my corpus and their frequency.

tdm <- TermDocumentMatrix(IC, control = list(weighting = weightTf,
                                                      tolower = FALSE))

Analyzing the Data

This is not the most informative representation of the data, however, as some terms might be very frequent but only occur in one document (tweet). To account for this, you calculate the term frequency-inverse document frequency, which balances how frequent a term by how frequently it occurs across all documents. This allows us to "this measure allows to omit terms which have low frequency as well as those occurring in many documents" (Grun and Hornik 2011: 12)

tfidf <- weightTfIdf(tdm, normalize = TRUE)

Analyzing the Data

You can use this matrix to explore your data. For example, I can look at terms which have occurred at least 50 times (result is alphabetical, not by relative frequency):

Analyzing the data

##  [1] "adoptvintagelove"               "alley"                         
##  [3] "amazing"                        "art"                           
##  [5] "back"                           "bakery"                        
##  [7] "bar"                            "bayarea"                       
##  [9] "bear"                           "beautiful"                     
## [11] "best"                           "birthday"                      
## [13] "brcccius"                       "brows"                         
## [15] "brucius"                        "cafe"                          
## [17] "california"                     "chapel"                        
## [19] "cinema"                         "city"                          
## [21] "clarion"                        "coffee"                        
## [23] "colone"                         "come"                          
## [25] "day"                            "district"                      
## [27] "dog"                            "dolores"                       
## [29] "dolorespark"                    "elbo"                          
## [31] "engraving"                      "etchingea"                     
## [33] "food"                           "foreign"                       
## [35] "francisco"                      "friday"                        
## [37] "friends"                        "fun"                           
## [39] "get"                            "good"                          
## [41] "got"                            "great"                         
## [43] "happy"                          "heavyblackheart"               
## [45] "jasonnevermind"                 "just"                          
## [47] "last"                           "lazy"                          
## [49] "life"                           "like"                          
## [51] "little"                         "love"                          
## [53] "manufactory"                    "mira"                          
## [55] "mission"                        "missionea"                     
## [57] "morning"                        "much"                          
## [59] "muttville"                      "muttvillesf"                   
## [61] "natural"                        "new"                           
## [63] "night"                          "now"                           
## [65] "one"                            "park"                          
## [67] "party"                          "photo"                         
## [69] "posted"                         "room"                          
## [71] "san"                            "sanfrancisco"                  
## [73] "science"                        "see"                           
## [75] "show"                           "smilingfacewithheartshapedeyes"
## [77] "sparkles"                       "street"                        
## [79] "streetart"                      "sunday"                        
## [81] "tartine"                        "tattoo"                        
## [83] "thanks"                         "theea"                         
## [85] "time"                           "today"                         
## [87] "tonight"                        "valencia"                      
## [89] "way"                            "weekend"

Analyzing the Data

Or I can look at the 'frequency of occurance of each word in the corpus' (Code taken from Eight 2 late). For this we use a Document Term Matrix where the rows are documents and columns are terms. This is different from the Term Document Matrix in which rows are terms and columns are documents. As we are adding the sums of columns to see what the most frequent terms are, it makes sense that we want a matrix in which the terms are columns with each row in that column representing a count of how many times that term has occured in a tweet.

#Get a document term matrix
dtm=t(tdm)
freq <- colSums(as.matrix(dtm))
#Length should be total number of terms
length(freq)

## [1] 22852

#create sort order (descending)
ord <- order(freq,decreasing=TRUE)
#inspect most frequently occurring terms
freq[head(ord)]

##      mission         park          san      dolores    francisco 
##         1115          788          770          762          697 
## sanfrancisco 
##          682

#inspect least frequently occurring terms
freq[tail(ord)]

##      zzjgu  zzlnjvvfu zznulsykus   zztfkjyo  zzwpxnnpl     zzzojw 
##          1          1          1          1          1          1

Analyzing the Data

My least frequently occuring terms are not that informative (a.k.a. they are gibberish!) If I am interested in editing my corpus so I just have those terms that occur from X to Y number of documents (tweets here – meaning that I can weed out some terms perhaps that are gibberish by just including terms that have happened in at least 5 tweets) I can do the following, also adapted from Eight 2 late:

#Check out those terms that occur in 20 to 300 tweets
dtmr <-DocumentTermMatrix(IC, control=list(bounds = list(global = c(5,300))))

#See what the most frequent and least frequent under these parameters
freqr <- colSums(as.matrix(dtmr))
#length should be total number of terms
length(freqr)

## [1] 1957

#create sort order (asc)
ordr <- order(freqr,decreasing=TRUE)
#inspect most frequently occurring terms
freqr[head(ordr)]

##  photo    day    new posted street  alley 
##    267    254    245    241    238    234

#inspect least frequently occurring terms
freqr[tail(ordr)]

##             website whiteheavycheckmark              within 
##                   5                   5                   5 
##                wont               wrong         yellowheart 
##                   5                   5                   5

#See terms that occur at least 100 times in corpus
findFreqTerms(dtmr,lowfreq=80)

##  [1] "adoptvintagelove"               "alley"                         
##  [3] "amazing"                        "art"                           
##  [5] "back"                           "bar"                           
##  [7] "bayarea"                        "beautiful"                     
##  [9] "best"                           "birthday"                      
## [11] "brcccius"                       "brows"                         
## [13] "brucius"                        "chapel"                        
## [15] "city"                           "clarion"                       
## [17] "coffee"                         "colone"                        
## [19] "come"                           "day"                           
## [21] "dog"                            "dolorespark"                   
## [23] "elbo"                           "engraving"                     
## [25] "etchingea"                      "fun"                           
## [27] "get"                            "good"                          
## [29] "got"                            "great"                         
## [31] "happy"                          "heavyblackheart"               
## [33] "jasonnevermind"                 "last"                          
## [35] "like"                           "little"                        
## [37] "love"                           "manufactory"                   
## [39] "morning"                        "muttvillesf"                   
## [41] "natural"                        "new"                           
## [43] "night"                          "now"                           
## [45] "one"                            "party"                         
## [47] "photo"                          "posted"                        
## [49] "room"                           "science"                       
## [51] "see"                            "show"                          
## [53] "smilingfacewithheartshapedeyes" "sparkles"                      
## [55] "street"                         "streetart"                     
## [57] "sunday"                         "tartine"                       
## [59] "tattoo"                         "thanks"                        
## [61] "theea"                          "time"                          
## [63] "today"                          "tonight"                       
## [65] "valencia"                       "weekend"

Analyzing the Data

I can also make word clouds. Code from here.

findFreqTerms(dtm, 100)

##  [1] "alley"           "art"             "back"           
##  [4] "bayarea"         "brucius"         "city"           
##  [7] "clarion"         "coffee"          "come"           
## [10] "day"             "district"        "dog"            
## [13] "dolores"         "dolorespark"     "francisco"      
## [16] "get"             "good"            "great"          
## [19] "happy"           "heavyblackheart" "just"           
## [22] "last"            "like"            "love"           
## [25] "manufactory"     "mission"         "morning"        
## [28] "muttvillesf"     "new"             "night"          
## [31] "now"             "one"             "park"           
## [34] "photo"           "posted"          "room"           
## [37] "san"             "sanfrancisco"    "sparkles"       
## [40] "street"          "streetart"       "tartine"        
## [43] "tattoo"          "theea"           "time"           
## [46] "today"           "tonight"         "valencia"

freq = data.frame(sort(colSums(as.matrix(dtm)), decreasing=TRUE))
wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))

Topic Modeling

This is not very informative, however, about the entire corpus. I am much more interested in seeing the different TOPICS people are tweeting about. To look at this, I must perform topic modeling.

Due to the nature of the distribution (Dirichlet distribution) used in our algorithm (Latent Dirichlet Allocation) we must select a number of topics beforehand that we want to model. I think this has something to do with the fact that the distribution is used to describe probabilities of multinomials… or that "the addition of the Dirichlet distribution to the model allows us to specify our prior beliefs about what data X is likely to occur" Savov 2014. Anyway I think it has something to do with the assumptions of the distribution we are using – this distribution is used to model probabilities, which must be fractions of some whole. If it doesn't know what the whole is it can't really model how big these fractions are!

Topic Modeling

Anyway, there are two ways to determine the number of topics you want to use. The first one is to just pick a number and play around from there to see what produces topics that make sense. The second is more technical and involves harmonic means. I will not discuss that here as it takes some time, but if you are interested in trying out this method, check out Graham and Ackland 2015

Topic Modeling

On to topic modeling! As mentioned, I'll be using an Latent Dirichlet Allocation and Gibbs Sampling. There are different ways to topic model using different algorithms, but this will be the one we'll try out today.

What do the parameters mean?

First we have to set the parameters for Gibbs sampling. A nice explanation of Gibbs sampling and what each of these parameters mean / do is provided by Kailash Awati: "Gibbs Sampling works by performing a random walk in such a way that reflexts the characterisitics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly refelxt the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to 4000. Following the burn-in period, we perform 2000 iterations, taking every 500th iteration for further use. The reason we do this is to avoid correlations between samples. We use 5 different sample points (nstart = 5) that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducability) so I have provided five random integers in y seed list. Finally I have set best to TRUE (actually a default setting) which instructs the algorithm to return results of the run with the highest posterial probability… these settings do not guarantee the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot is this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so if it is good enough if the results make sense".

Topic Modeling

So let's set up the parameters for the topic model

#Set parameters for Gibbs sampling (parameters those used in Grun and Hornik 2011)
#'burnin' refers to the burn-in period, discarding those first few steps of the 'walk'  
burnin <- 4000
#'iter' refers to the number of iterations; 'thin' is what iteration we take for 'further use' -> here the 500th iteration of the 2000 iterations we've specified 
iter <- 2000
thin <- 500
#'seed' a list of 5 random integers
seed <-list(2003,5,63,100001,765)
#'n-start' is how many independent runs / sample points  
nstart <- 5
#'best' set to TRUE means 'instructs the algorithm to return results of the run with the highest posterial probability'
best <- TRUE

Topic Modeling

I set the number of topics to 10 (I found this number to work best in terms of 'making sense' – but this can change depending on how much data you have, how varied the data is, etc.)

k<-10
#For the model we are using the document term matrix (dtm) NOT the term document matrix (tdm)
#Run the model
# ldaOut <-LDA(dtm,k, method="Gibbs", 
#              control=list(nstart=nstart, seed = seed, best=best, 
#                           burnin = burnin, iter = iter, thin=thin))
#Save the output
#save(ldaOut,file="ldaOut.RData")

Topic Modeling

Now we can start to explore the results. The topic model has gone through our corpus and assigned each term to a topic. Each term is also given a probability, which is the probability that it will occur in the assigned topic. We can look at the top terms in each topic to get an idea of the most probable words for each topic and come up with a qualitative description.

#Load the output back in
load("ldaOut.RData")
#Look at results
ldaOut.topics <- as.matrix(topics(ldaOut))
topTenTermsEachTopic <- terms(ldaOut,10)

Topic Model Results

##       Topic 1          Topic 2   Topic 3            Topic 4    
##  [1,] "street"         "just"    "sanfrancisco"     "new"      
##  [2,] "now"            "photo"   "bayarea"          "theea"    
##  [3,] "back"           "posted"  "muttvillesf"      "thanks"   
##  [4,] "valencia"       "best"    "dog"              "see"      
##  [5,] "jasonnevermind" "got"     "tattoo"           "work"     
##  [6,] "friends"        "first"   "brucius"          "old"      
##  [7,] "great"          "cafe"    "natural"          "chocolate"
##  [8,] "people"         "kitchen" "science"          "store"    
##  [9,] "like"           "ever"    "adoptvintagelove" "week"     
## [10,] "always"         "cream"   "engraving"        "thrift"   
##       Topic 5       Topic 6           Topic 7      Topic 8  
##  [1,] "park"        "alley"           "mission"    "night"  
##  [2,] "dolores"     "clarion"         "san"        "tonight"
##  [3,] "day"         "love"            "francisco"  "last"   
##  [4,] "today"       "art"             "district"   "come"   
##  [5,] "dolorespark" "streetart"       "sanea"      "get"    
##  [6,] "sparkles"    "heavyblackheart" "drafthouse" "chapel" 
##  [7,] "brows"       "mural"           "alamo"      "little" 
##  [8,] "beautiful"   "amazing"         "fran"       "every"  
##  [9,] "fun"         "life"            "heart"      "show"   
## [10,] "city"        "area"            "days"       "going"  
##       Topic 9              Topic 10                        
##  [1,] "one"                "tartine"                       
##  [2,] "happy"              "coffee"                        
##  [3,] "time"               "good"                          
##  [4,] "birthday"           "room"                          
##  [5,] "sunday"             "manufactory"                   
##  [6,] "friday"             "morning"                       
##  [7,] "mira"               "elbo"                          
##  [8,] "facewithtearsofjoy" "party"                         
##  [9,] "bear"               "colone"                        
## [10,] "lazy"               "smilingfacewithheartshapedeyes"

Topic Model Results

I can now create a CSV (comma separated value) file to look at in Excel and apply some labels to my topics.

#Check at top 50 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,50))
#Save as CSV file to look at a bit closer
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicstoTerms.csv"))

Topic Model Results

We can also get a look at the probabilities associated with each topic assignment. What we want is to eventually use these probabilities in a regression. Does an association with one of these topics predict a variable of interest?

topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))

Topic Model Results

Having looked at the top 50 most probable terms for each topic, I've come up with some labels to describe them. What I want to do now is link back those topics to the tweets that have been assigned them. This involves joining two data frames together, and then replacing the numbers that describe my topics with my own invented prose versions.

#Write out the topics to a data frame so you can work with them
test <- as.data.frame(ldaOut.topics)
a<-c('Evaluation', 'Food','Service/Product Promos', 'Activities', 'Outdoors',
'Art', 'Places','Nightlife','Leisure','Hip Spots')
b<-c(1,2,3,4,5,6,7,8,9,10)
namesdf<-data.frame("Name"=a,"Number"=b)
test$V1<-as.factor(test$V1)
newtopics <- FindReplace(data = test, Var = "V1", replaceData = namesdf,
                       from = "Number", to = "Name", exact = TRUE)

## Only exact matches will be replaced.

#Merge topics with tweet corpus
data$ID <- 1:nrow(data)
newtopics$ID <- 1:nrow(newtopics)
topicdata <- merge(data,newtopics,by="ID")

#Merge topic probabilities with tweet corpus
topicProbabilities$ID <- 1:nrow(topicProbabilities)
newdata <- merge(topicdata, topicProbabilities,by="ID")
write.csv(newdata,file=paste("Tweetswtopics.csv"))

Visualizing Topic Model Results

You can now map your posts and see where assigned topics are happening!

newdata$longitude<-as.numeric(newdata$longitude)
newdata$latitude <- as.numeric(newdata$latitude)
lon<-newdata$longitude
lat<-newdata$latitude
newdata$V1.x <- factor(newdata$V1.x)
Topics<-newdata$V1.x
mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, color=Topics), 
                                     data=newdata, alpha=0.5)

Visualizing Topic Model Results

mapPoints

Visualizing Topic Model Results

This can be kind of messy, so we can subset our data to just look at particular topics.

#Subset the data by all those posts NOT categorized as Promos, etc.
sub <- newdata[! newdata$V1.x %in% c("Activities",
               "Places","Service/Product Promos"),]

Visualizing Topic Model Results

Look at the simplified map

sub$longitude<-as.numeric(sub$longitude)
sub$latitude <- as.numeric(sub$latitude)
lon<-sub$longitude
lat<-sub$latitude
sub$V1.x <- as.factor(sub$V1.x)
Topics<- sub$V1.x

mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, color=Topics), 
                                     data=sub, alpha=0.5)

Visualizing Topic Model Results

mapPoints

Visualizing Topic Model Results

We can zoom into particular areas too to take a closer look at what is going on:

map2 <- get_map(location = 'Dolores Street and 19th Street, San Francisco, California', zoom = 17)

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Dolores+Street+and+19th+Street,+San+Francisco,+California&zoom=17&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Dolores%20Street%20and%2019th%20Street,%20San%20Francisco,%20California&sensor=false

ggmap(map2)

mapPoints2 <- ggmap(map2) + geom_point(aes(x = lon, y = lat, color=Topics), 
                                     data=sub, alpha=0.5)

Visualizing Topic Model Results

mapPoints2

Visualizing Topic Model Results

map3 <- get_map(location = 'Capp Street and 24th Street, San Francisco, California', zoom = 17)

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Capp+Street+and+24th+Street,+San+Francisco,+California&zoom=17&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Capp%20Street%20and%2024th%20Street,%20San%20Francisco,%20California&sensor=false

ggmap(map3)

mapPoints3 <- ggmap(map3) + geom_point(aes(x = lon, y = lat, color=Topics), 
                                     data=sub, alpha=0.5)

Visualizing Topic Model Results

mapPoints3

Visualizing Topic Model Results

We can also see how topics occur over time

#Create a data frame with number of tweets per time
d2 <- as.data.frame(table(newdata$created))
d2 <- d2[order(d2$Freq, decreasing=T), ]
names(d2) <- c("created","freq")
#Combine this with existing data frame
newdata2 <- merge(newdata,d2,by="created")
#Tell R that 'created' is not an integer or factor but a time.
newdata2$created <- as.POSIXct(newdata2$created, format="%m/%d/%Y %H:%M")
#20 minute intervals
minutes <- 60
Topics<-newdata2$V1.x

ggplot(newdata2, aes(created, color = Topics)) + 
  geom_freqpoly(binwidth=60*minutes)

Visualizing Topic Model Results

newdata3 <- newdata2[newdata2$created <= "0016-08-03 00:31:00", ]
minutes <- 60
Topics<-newdata3$V1.x
Freq<-newdata3$freq

ggplot(newdata3, aes(created, color = Topics)) + 
  geom_freqpoly(binwidth=60*minutes)

Visualizing Topic Model Results

I can also look at the frequency of these topics over time in a more abstract sense, by treating the posts as happening in one day to see overall patterns.

newdata$created2 <- as.POSIXct(newdata$created, format="%m/%d/%Y %H:%M")
newdata$created3<-format(newdata$created2,'%H:%M:%S')
d3 <- as.data.frame(table(newdata$created3))
d3 <- d3[order(d3$Freq, decreasing=T), ]
names(d3) <- c("created3","freq3")
newdata <- merge(newdata,d3,by="created3")
newdata$created3 <- as.POSIXct(newdata$created3, format="%H:%M:%S")
minutes <- 60
Topics<-newdata$V1.x
overalltimes <- ggplot(newdata, aes(created3, color = Topics)) + 
  geom_freqpoly(binwidth=60*minutes)

print(overalltimes)