Linguistic Landscapes is a subfield of sociolinguistics which encompasses study of "the visibility and salience of languages on public and commercial signs in a given territory or region" (Landry and Bourhis 1997).
November 3, 2016
Linguistic Landscapes is a subfield of sociolinguistics which encompasses study of "the visibility and salience of languages on public and commercial signs in a given territory or region" (Landry and Bourhis 1997).
With Twitter's search API you can search for tweets that have been 'geotagged': tweets that have longitude and latitude coordinates attached to them.
My research is based in the Mission District neighborhood in San Francisco, so I am using coordinates that will get me tweets from that area. I've chosen a radius of 1 km from this central coordinate but you can expand this radius.
geo=searchTwitter('',n=100000, geocode='37.76,-122.42,1km', retryOnRateLimit=1)
## Warning in doRppAPICall("search/tweets", n, params = params, ## retryOnRateLimit = retryOnRateLimit, : 100000 tweets were requested but the ## API can only return 2028
This could take some time depending on how fast your Internet is and how many tweets are available. I was optimistic with my request for 100,000 tweets, so I receive an error message that says how many were actually available. You can play around with this number too if you are lucky enough to have more than 100,000 tweets available.
Now you have a list of tweets. Lists are very difficult to deal with in R, so you convert this into a data frame:
geoDF<-twListToDF(geo)
Chances are there will be emojis in your Twitter data. You can 'transform' these emojis into prose using this code as well as a CSV file I've put together of what all of the emojis look like in R. (The idea for this comes from Jessica Peterka-Bonetta's work – she has a list of emojis as well, but it does not include the newest batch of emojis nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.
Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.
emoticons <- read.csv("Decoded Emojis Col Sep.csv", header = T)
To transform the emojis, you first need to transform the tweet data into ASCII:
geoDF$text <- iconv(geoDF$text, from = "latin1", to = "ascii", sub = "byte")
To 'count' the emojis you do a find and replace using the CSV file of 'Decoded Emojis' as a reference. Here I am using the DataCombine package. What this does is identifies emojis in the tweeted Instagram posts and then replaces them with a prose version. I used whatever description pops up when hovering one's cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.
emojireplace <- FindReplace(data = geoDF, Var = "text", replaceData = emoticons, from = "R_Encoding", to = "Name", exact = FALSE)
Now might be a good time to save this file. I save it in a CSV format with the date of when I collected the posts.
write.csv(emojireplace,file=paste("ALL",Sys.Date(),".csv"))
Now you have a data frame which you can manipulate in various ways. For my research, I'm just interested in posts that have occured on Instagram. (Why not just access them via Instagram's API you ask? Long story short: they are very very conservative about providing access for academic research). I've found a work-around which is filtering mined tweets by those that have Instagram as a source:
data <- emojireplace[emojireplace$statusSource == "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>", ] #Save this file write.csv(data,file=paste("INSTA",Sys.Date(),".csv"))
Important: Obviously, data collected this way are not representative of all Instagram posts made in the Mission District (as we depend on people who cross-post to Twitter which is most likely the minority of Mission District Instagrammers) however this is an important point about any data obtained via social media: it's never truly representative. Partly because individuals must be assumed to be selective when they post, as posting is an inherently subjective process, and not everyone is active on social media.
Now let's play around with visualizing the data. I want to superimpose different aspects of the tweets I collected on a map. First I have to get a map, which I do using the ggmap package which interacts with Google Map's API. When you use this package, be sure to cite it, as it requests you to when you first load the package into your library.
I request a map of the Mission District, and then check to make sure the map is what I want (in terms of zoom, area covered, etc.)
map <- get_map(location = 'Capp St. and 20th, San Francisco, California', zoom = 15)
ggmap(map)
Looks good to me! Now let's start to visualize our Instagram-via-Twitter data. We can start by seeing where our posts are on a map.
#Tell R what we want to map data$longitude<-as.numeric(data$longitude) data$latitude<-as.numeric(data$latitude) lon <- data$longitude lat <- data$latitude
For now I just want to look at latitude and longitude, but it is possible to map other aspects as well - it just depends on what you'd like to look at.
Now we use ggplot to plot our Instagram data over our map:
mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat), data=data, alpha=0.5)
mapPoints
We can also look at WHEN the posts were generated. We can make a graph of post frequency over time.Graphs constructed with help from here, here, here, here, here, here, here and here.
#Create a data frame with number of tweets per time d <- as.data.frame(table(data$created)) d <- d[order(d$Freq, decreasing=T), ] names(d) <- c("created","freq") #Combine this with existing data frame newdata1 <- merge(data,d,by="created") #Tell R that 'created' is not an integer or factor but a time. data$created <- as.POSIXct(data$created)
Now plot number of tweets over period of time across 20 minute intervals
minutes <- 60 Freq<-data$freq plot1<-ggplot(data, aes(created)) + geom_freqpoly(binwidth=60*minutes)
This might be more informative if you want to look at specific time periods. We can look at the frequency of posts over the course of a specific day if we want.
data2 <- data[data$created <= "2016-10-25 00:31:00", ] minutes <- 60 Freq<-data2$freq plot2<-ggplot(data2, aes(created)) + geom_freqpoly(binwidth=60*minutes)
Now we will look at how we can use topic modeling for our data. I will be using my larger Instagram corpus of about 7,000 posts available here.
#Packages used packs = c("topicmodels","slam","Rmpfr","tm","stringr","ggplot2", "ggmap","wordcloud","plyr","DataCombine","tidytext","dplyr") lapply(packs, library, character.only=T) #Load in data data=read.csv("Col_Sep_INSTACORPUS.csv", header=T)
The data need to be processed a bit more in order to analyze them. First I remove URLs:
data$text = gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", data$text) #Also get rid of ampersand coding data$text = gsub("&","", data$text)
I tell R that I want it to think of the text of the tweets as the corpus. I need to label the text as a corpus so I can use operations available from the tm package (Feinerer and Hornik 2015; Feinerer, Hornik and Meyer 2008).
#Tell R that your corpus is the text in the data frame #that you have just uploaded corpus <- Corpus(VectorSource(data$text)) IC=corpus
I use the tm package to clean the data further:
IC = tm_map(IC, stripWhitespace) IC = tm_map(IC, removePunctuation) IC = tm_map(IC, tolower) IC = tm_map(IC, PlainTextDocument) IC = tm_map(IC, removeNumbers) #stopwords("english") IC = tm_map(IC, removeWords, stopwords("english")) #IC = tm_map(IC, stemDocument, "english")
My corpus is now ready for some linguistic analyses. First I use the TermDocumentMatrix function to create a term-document matrix. This is a matrix of all the terms present in my corpus and their frequency.
tdm <- TermDocumentMatrix(IC, control = list(weighting = weightTf, tolower = FALSE))
Calculate the term frequency-inverse document frequency, which balances how frequent a term by how frequently it occurs across all documents. This allows us to "this measure allows to omit terms which have low frequency as well as those occurring in many documents" (Grun and Hornik 2011: 12)
tfidf <- weightTfIdf(tdm, normalize = TRUE)
You can use this matrix to explore your data. For example, I can look at terms which have occurred at least 50 times (result is alphabetical, not by relative frequency):
findFreqTerms(tfidf, 50)
## [1] "adoptvintagelove" "alley" ## [3] "amazing" "art" ## [5] "back" "bakery" ## [7] "bar" "bayarea" ## [9] "bear" "beautiful" ## [11] "best" "birthday" ## [13] "brcccius" "brows" ## [15] "brucius" "cafe" ## [17] "california" "chapel" ## [19] "cinema" "city" ## [21] "clarion" "coffee" ## [23] "colone" "come" ## [25] "day" "district" ## [27] "dog" "dolores" ## [29] "dolorespark" "elbo" ## [31] "engraving" "etchingea" ## [33] "food" "foreign" ## [35] "francisco" "friday" ## [37] "friends" "fun" ## [39] "get" "good" ## [41] "got" "great" ## [43] "happy" "heavyblackheart" ## [45] "jasonnevermind" "just" ## [47] "last" "lazy" ## [49] "life" "like" ## [51] "little" "love" ## [53] "manufactory" "mira" ## [55] "mission" "missionea" ## [57] "morning" "much" ## [59] "muttville" "muttvillesf" ## [61] "natural" "new" ## [63] "night" "now" ## [65] "one" "park" ## [67] "party" "photo" ## [69] "posted" "room" ## [71] "san" "sanfrancisco" ## [73] "science" "see" ## [75] "show" "smilingfacewithheartshapedeyes" ## [77] "sparkles" "street" ## [79] "streetart" "sunday" ## [81] "tartine" "tattoo" ## [83] "thanks" "theea" ## [85] "time" "today" ## [87] "tonight" "valencia" ## [89] "way" "weekend"
I can also make word clouds. Code from here.
dtm=t(tdm) freq = data.frame(sort(colSums(as.matrix(dtm)), decreasing=TRUE))
On to topic modeling! I'll be using an Latent Dirichlet Allocation and Gibbs Sampling. There are different ways to topic model using different algorithms, but this will be the one we'll try out today.
So let's set up the parameters for the topic model
#Set parameters for Gibbs sampling (parameters those used in #Grun and Hornik 2011) burnin <- 4000 iter <- 2000 thin <- 500 seed <-list(2003,5,63,100001,765) nstart <- 5 best <- TRUE
I set the number of topics to 10 (I found this number to work best in terms of 'making sense' – but this can change depending on how much data you have, how varied the data is, etc.)
k<-10 #For the model we are using the document term matrix (dtm) #NOT the term document matrix (tdm) # #Run the model # ldaOut <-LDA(dtm,k, method="Gibbs", # control=list(nstart=nstart, seed = seed, best=best, # burnin = burnin, iter = iter, thin=thin)) # #Save the output # save(ldaOut,file="ldaOut.RData")
Now we can start to explore the results. The topic model has gone through our corpus and assigned each term to a topic. Each term is also given a probability, which is the probability that it will occur in the assigned topic. We can look at the top terms in each topic to get an idea of the most probable words for each topic and come up with a qualitative description.
#Load the output back in load("ldaOut.RData") #Look at results ldaOut.topics <- as.matrix(topics(ldaOut)) topTenTermsEachTopic <- terms(ldaOut,10) #print(topTenTermsEachTopic)
I can now create a CSV (comma separated value) file to look at in Excel to get a better idea.
#Check at top 50 terms in each topic ldaOut.terms <- as.matrix(terms(ldaOut,50)) #Save as CSV file to look at a bit closer write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicstoTerms.csv"))
We can also get a look at the probabilities associated with each topic assignment. What we want is to eventually use these probabilities in a regression. Does an association with one of these topics predict a variable of interest?
topicProbabilities <- as.data.frame(ldaOut@gamma) write.csv(topicProbabilities, file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
Having looked at the top 50 most probable terms for each topic, I've come up with some labels to describe them. What I want to do now is link back those topics to the tweets that have been assigned them. This involves joining two data frames together, and then replacing the numbers that describe my topics with my own invented prose versions.
#Write out the topics to a data frame so you can work with them test <- as.data.frame(ldaOut.topics) a<-c('Evaluation', 'Food','Service/Product Promos', 'Activities', 'Outdoors', 'Art', 'Places','Nightlife','Leisure','Hip Spots') b<-c(1,2,3,4,5,6,7,8,9,10) namesdf<-data.frame("Name"=a,"Number"=b) test$V1<-as.factor(test$V1) newtopics <- FindReplace(data = test, Var = "V1", replaceData = namesdf, from = "Number", to = "Name", exact = TRUE)
## Only exact matches will be replaced.
#Merge topics with tweet corpus data$ID <- 1:nrow(data) newtopics$ID <- 1:nrow(newtopics) topicdata <- merge(data,newtopics,by="ID") #Merge topic probabilities with tweet corpus topicProbabilities$ID <- 1:nrow(topicProbabilities) newdata <- merge(topicdata, topicProbabilities,by="ID") write.csv(newdata,file=paste("Tweetswtopics.csv")) newdata=read.csv("Tweetswtopics.csv", header=T)
You can now map your posts and see where assigned topics are happening!
newdata$longitude<-as.numeric(newdata$longitude) newdata$latitude <- as.numeric(newdata$latitude) lon<-newdata$longitude lat<-newdata$latitude newdata$V1.x <- factor(newdata$V1.x) Topics<-newdata$V1.x mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, color=Topics), data=newdata, alpha=0.5)
mapPoints
This can be kind of messy, so we can subset our data to just look at particular topics.
#Subset the data by all those posts NOT categorized as Promos, etc. sub <- newdata[! newdata$V1.x %in% c("Activities", "Places","Service/Product Promos"),]
Look at the simplified map
sub$longitude<-as.numeric(sub$longitude) sub$latitude <- as.numeric(sub$latitude) lon<-sub$longitude lat<-sub$latitude sub$V1.x <- as.factor(sub$V1.x) Topics<- sub$V1.x mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, color=Topics), data=sub, alpha=0.5)
mapPoints
We can zoom into particular areas too to take a closer look at what is going on:
map2 <- get_map(location = 'Dolores Street and 19th Street, San Francisco, California', zoom = 17) mapPoints2 <- ggmap(map2) + geom_point(aes(x = lon, y = lat, color=Topics), data=sub, alpha=0.5)
mapPoints2
map3 <- get_map(location = 'Capp Street and 24th Street, San Francisco, California', zoom = 17) mapPoints3 <- ggmap(map3) + geom_point(aes(x = lon, y = lat, color=Topics), data=sub, alpha=0.5)
mapPoints3
map4 <- get_map(location = '595 Alabama St, San Francisco, CA 94110', zoom = 17) mapPoints4 <- ggmap(map4) + geom_point(aes(x = lon, y = lat, color=Topics), data=sub, alpha=0.5)
mapPoints4
We can also see how topics occur over time
#Create a data frame with number of tweets per time d2 <- as.data.frame(table(newdata$created)) d2 <- d2[order(d2$Freq, decreasing=T), ] names(d2) <- c("created","freq") #Combine this with existing data frame newdata2 <- merge(newdata,d2,by="created") #Tell R that 'created' is not an integer or factor but a time. newdata2$created <- as.POSIXct(newdata2$created, format="%m/%d/%Y %H:%M") #20 minute intervals minutes <- 60 Topics<-newdata2$V1.x
ggplot(newdata2, aes(created, color = Topics)) + geom_freqpoly(binwidth=60*minutes)
newdata3 <- newdata2[newdata2$created <= "0016-08-03 00:31:00", ] minutes <- 60 Topics<-newdata3$V1.x Freq<-newdata3$freq
ggplot(newdata3, aes(created, color = Topics)) + geom_freqpoly(binwidth=60*minutes)
I can also look at the frequency of these topics over time in a more abstract sense, by treating the posts as happening in one day to see overall patterns.
newdata$created2 <- as.POSIXct(newdata$created, format="%m/%d/%Y %H:%M") newdata$created3<-format(newdata$created2,'%H:%M:%S') d3 <- as.data.frame(table(newdata$created3)) d3 <- d3[order(d3$Freq, decreasing=T), ] names(d3) <- c("created3","freq3") newdata <- merge(newdata,d3,by="created3") newdata$created3 <- as.POSIXct(newdata$created3, format="%H:%M:%S") minutes <- 60 Topics<-newdata$V1.x overalltimes <- ggplot(newdata, aes(created3, color = Topics)) + geom_freqpoly(binwidth=60*minutes)
print(overalltimes)