Linguistic Landscapes is a subfield of sociolinguistics which encompasses study of "the visibility and salience of languages on public and commercial signs in a given territory or region" (Landry and Bourhis 1997).
November 3, 2016
Linguistic Landscapes is a subfield of sociolinguistics which encompasses study of "the visibility and salience of languages on public and commercial signs in a given territory or region" (Landry and Bourhis 1997).
With Twitter's search API you can search for tweets that have been 'geotagged': tweets that have longitude and latitude coordinates attached to them.
My research is based in the Mission District neighborhood in San Francisco, so I am using coordinates that will get me tweets from that area. I've chosen a radius of 1 km from this central coordinate but you can expand this radius.
To collect tweets you first need to set up a 'developer account' at Twitter and create a new application. This allows you to access Twitter's API. Detailed instructions on how to do this are available in Nathan Danneman and Richard Heimann's book Social Media Mining with R (highly recommended) or via other sources on the web.
geo=searchTwitter('',n=100000, geocode='37.76,-122.42,1km', retryOnRateLimit=1)
## Warning in doRppAPICall("search/tweets", n, params = params, ## retryOnRateLimit = retryOnRateLimit, : 100000 tweets were requested but the ## API can only return 2033
This could take some time depending on how fast your Internet is and how many tweets are available. I was optimistic with my request for 100,000 tweets, so I receive an error message that says how many were actually available. You can play around with this number too if you are lucky enough to have more than 100,000 tweets available.
Now you have a list of tweets. Lists are very difficult to deal with in R, so you convert this into a data frame:
geoDF<-twListToDF(geo)
Chances are there will be emojis in your Twitter data. You can 'transform' these emojis into prose using this code as well as a CSV file I've put together of what all of the emojis look like in R. (The idea for this comes from Jessica Peterka-Bonetta's work – she has a list of emojis as well, but it does not include the newest batch of emojis nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.
Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.
emoticons <- read.csv("Decoded Emojis Col Sep.csv", header = T)
To transform the emojis, you first need to transform the tweet data into ASCII:
geoDF$text <- iconv(geoDF$text, from = "latin1", to = "ascii", sub = "byte")
To 'count' the emojis you do a find and replace using the CSV file of 'Decoded Emojis' as a reference. Here I am using the DataCombine package. What this does is identifies emojis in the tweeted Instagram posts and then replaces them with a prose version. I used whatever description pops up when hovering one's cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.
emojireplace <- FindReplace(data = geoDF, Var = "text", replaceData = emoticons, from = "R_Encoding", to = "Name", exact = FALSE)
Now might be a good time to save this file. I save it in a CSV format with the date of when I collected the posts.
write.csv(emojireplace,file=paste("ALL",Sys.Date(),".csv"))
Now you have a data frame which you can manipulate in various ways. For my research, I'm just interested in posts that have occured on Instagram. (Why not just access them via Instagram's API you ask? Long story short: they are very very conservative about providing access for academic research). I've found a work-around which is filtering mined tweets by those that have Instagram as a source:
data <- emojireplace[emojireplace$statusSource == "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>", ] #Save this file write.csv(data,file=paste("INSTA",Sys.Date(),".csv"))
Important: Obviously, data collected this way are not representative of all Instagram posts made in the Mission District (as we depend on people who cross-post to Twitter which is most likely the minority of Mission District Instagrammers) however this is an important point about any data obtained via social media: it's never truly representative. Partly because individuals must be assumed to be selective when they post, as posting is an inherently subjective process, and not everyone is active on social media.
Now let's play around with visualizing the data. I want to superimpose different aspects of the tweets I collected on a map. First I have to get a map, which I do using the ggmap package which interacts with Google Map's API. When you use this package, be sure to cite it, as it requests you to when you first load the package into your library.
I request a map of the Mission District, and then check to make sure the map is what I want (in terms of zoom, area covered, etc.)
map <- get_map(location = 'Capp St. and 20th, San Francisco, California', zoom = 15)
ggmap(map)
Looks good to me! Now let's start to visualize our Instagram-via-Twitter data. We can start by seeing where our posts are on a map.
#Tell R what we want to map data$longitude<-as.numeric(data$longitude) data$latitude<-as.numeric(data$latitude) lon <- data$longitude lat <- data$latitude
For now I just want to look at latitude and longitude, but it is possible to map other aspects as well - it just depends on what you'd like to look at.
Now we use ggplot to plot our Instagram data over our map:
mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat), data=data, alpha=0.5)
Voila! You have a map of your data.
mapPoints
We can also look at WHEN the posts were generated. We can make a graph of post frequency over time.Graphs constructed with help from here, here, here, here, here, [here] (https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXlt.html), here and here.
#Create a data frame with number of tweets per time d <- as.data.frame(table(data$created)) d <- d[order(d$Freq, decreasing=T), ] names(d) <- c("created","freq") #Combine this with existing data frame data <- merge(data,d,by="created") #Tell R that 'created' is not an integer or factor but a time. data$created <- as.POSIXct(data$created)
Now plot number of tweets over period of time across 20 minute intervals
minutes <- 60 Freq<-data$freq plot1 <- ggplot(data, aes(created)) + geom_freqpoly(binwidth=60*minutes)
This might be more informative if you want to look at specific time periods. We can look at the frequency of posts over the course of a specific day if we want:
data2 <- data[data$created <= "0016-08-03 00:31:00", ] minutes <- 60 Freq<-data2$freq plot2 <- ggplot(data2, aes(created)) + geom_freqpoly(binwidth=60*minutes)
Now we will look at how we can use topic modeling for our data. I will be using my larger Instagram corpus of about 7,000 posts available here
#Packages used packs = c("topicmodels","slam","Rmpfr","tm","stringr","ggplot2","ggmap","wordcloud","plyr","DataCombine") lapply(packs, library, character.only=T) #Load in data data=read.csv("Col_Sep_INSTACORPUS.csv", header=T)
The data need to be processed a bit more in order to analyze them. First I remove URLs:
data$text = gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", data$text) #Also get rid of ampersand coding data$text = gsub("&","", data$text)
I tell R that I want it to think of the text of the tweets as the corpus. I need to label the text as a corpus so I can use operations available from the tm package (Feinerer and Hornik 2015; Feinerer, Hornik and Meyer 2008).
#Tell R that your corpus is the text in the data frame that you have just uploaded corpus <- Corpus(VectorSource(data$text)) IC=corpus
I use the tm package to clean the data further:
#Get rid of whitespace IC = tm_map(IC, stripWhitespace) #Get rid of punctuation IC = tm_map(IC, removePunctuation) #Make everything lowercase (i.e. don't want to represent Good and good as two different words) IC = tm_map(IC, tolower) #Turn it into a plain text document which we can work with IC = tm_map(IC, PlainTextDocument) #Get rid of numbers IC = tm_map(IC, removeNumbers) #Get rid of stopwords. Available stopword lists can be ascertained with the following command -- you will get an error if stop words do not exist for that language. You could also build your own list. #stopwords("english") IC = tm_map(IC, removeWords, stopwords("english")) #You might want to skip this last step if you are interested in seeing entire words -- helpful if you want to generate a word cloud with full words, etc. #IC = tm_map(IC, stemDocument, "english")
My corpus is now ready for some linguistic analyses. First I use the TermDocumentMatrix function to create a term-document matrix. This is a matrix of all the terms present in my corpus and their frequency.
tdm <- TermDocumentMatrix(IC, control = list(weighting = weightTf, tolower = FALSE))
This is not the most informative representation of the data, however, as some terms might be very frequent but only occur in one document (tweet). To account for this, you calculate the term frequency-inverse document frequency, which balances how frequent a term by how frequently it occurs across all documents. This allows us to "this measure allows to omit terms which have low frequency as well as those occurring in many documents" (Grun and Hornik 2011: 12)
tfidf <- weightTfIdf(tdm, normalize = TRUE)
You can use this matrix to explore your data. For example, I can look at terms which have occurred at least 50 times (result is alphabetical, not by relative frequency):
## [1] "adoptvintagelove" "alley" ## [3] "amazing" "art" ## [5] "back" "bakery" ## [7] "bar" "bayarea" ## [9] "bear" "beautiful" ## [11] "best" "birthday" ## [13] "brcccius" "brows" ## [15] "brucius" "cafe" ## [17] "california" "chapel" ## [19] "cinema" "city" ## [21] "clarion" "coffee" ## [23] "colone" "come" ## [25] "day" "district" ## [27] "dog" "dolores" ## [29] "dolorespark" "elbo" ## [31] "engraving" "etchingea" ## [33] "food" "foreign" ## [35] "francisco" "friday" ## [37] "friends" "fun" ## [39] "get" "good" ## [41] "got" "great" ## [43] "happy" "heavyblackheart" ## [45] "jasonnevermind" "just" ## [47] "last" "lazy" ## [49] "life" "like" ## [51] "little" "love" ## [53] "manufactory" "mira" ## [55] "mission" "missionea" ## [57] "morning" "much" ## [59] "muttville" "muttvillesf" ## [61] "natural" "new" ## [63] "night" "now" ## [65] "one" "park" ## [67] "party" "photo" ## [69] "posted" "room" ## [71] "san" "sanfrancisco" ## [73] "science" "see" ## [75] "show" "smilingfacewithheartshapedeyes" ## [77] "sparkles" "street" ## [79] "streetart" "sunday" ## [81] "tartine" "tattoo" ## [83] "thanks" "theea" ## [85] "time" "today" ## [87] "tonight" "valencia" ## [89] "way" "weekend"
Or I can look at the 'frequency of occurance of each word in the corpus' (Code taken from Eight 2 late). For this we use a Document Term Matrix where the rows are documents and columns are terms. This is different from the Term Document Matrix in which rows are terms and columns are documents. As we are adding the sums of columns to see what the most frequent terms are, it makes sense that we want a matrix in which the terms are columns with each row in that column representing a count of how many times that term has occured in a tweet.
#Get a document term matrix dtm=t(tdm) freq <- colSums(as.matrix(dtm)) #Length should be total number of terms length(freq)
## [1] 22852
#create sort order (descending) ord <- order(freq,decreasing=TRUE) #inspect most frequently occurring terms freq[head(ord)]
## mission park san dolores francisco ## 1115 788 770 762 697 ## sanfrancisco ## 682
#inspect least frequently occurring terms freq[tail(ord)]
## zzjgu zzlnjvvfu zznulsykus zztfkjyo zzwpxnnpl zzzojw ## 1 1 1 1 1 1
My least frequently occuring terms are not that informative (a.k.a. they are gibberish!) If I am interested in editing my corpus so I just have those terms that occur from X to Y number of documents (tweets here – meaning that I can weed out some terms perhaps that are gibberish by just including terms that have happened in at least 5 tweets) I can do the following, also adapted from Eight 2 late:
#Check out those terms that occur in 20 to 300 tweets dtmr <-DocumentTermMatrix(IC, control=list(bounds = list(global = c(5,300)))) #See what the most frequent and least frequent under these parameters freqr <- colSums(as.matrix(dtmr)) #length should be total number of terms length(freqr)
## [1] 1957
#create sort order (asc) ordr <- order(freqr,decreasing=TRUE) #inspect most frequently occurring terms freqr[head(ordr)]
## photo day new posted street alley ## 267 254 245 241 238 234
#inspect least frequently occurring terms freqr[tail(ordr)]
## website whiteheavycheckmark within ## 5 5 5 ## wont wrong yellowheart ## 5 5 5
#See terms that occur at least 100 times in corpus findFreqTerms(dtmr,lowfreq=80)
## [1] "adoptvintagelove" "alley" ## [3] "amazing" "art" ## [5] "back" "bar" ## [7] "bayarea" "beautiful" ## [9] "best" "birthday" ## [11] "brcccius" "brows" ## [13] "brucius" "chapel" ## [15] "city" "clarion" ## [17] "coffee" "colone" ## [19] "come" "day" ## [21] "dog" "dolorespark" ## [23] "elbo" "engraving" ## [25] "etchingea" "fun" ## [27] "get" "good" ## [29] "got" "great" ## [31] "happy" "heavyblackheart" ## [33] "jasonnevermind" "last" ## [35] "like" "little" ## [37] "love" "manufactory" ## [39] "morning" "muttvillesf" ## [41] "natural" "new" ## [43] "night" "now" ## [45] "one" "party" ## [47] "photo" "posted" ## [49] "room" "science" ## [51] "see" "show" ## [53] "smilingfacewithheartshapedeyes" "sparkles" ## [55] "street" "streetart" ## [57] "sunday" "tartine" ## [59] "tattoo" "thanks" ## [61] "theea" "time" ## [63] "today" "tonight" ## [65] "valencia" "weekend"
I can also make word clouds. Code from here.
findFreqTerms(dtm, 100)
## [1] "alley" "art" "back" ## [4] "bayarea" "brucius" "city" ## [7] "clarion" "coffee" "come" ## [10] "day" "district" "dog" ## [13] "dolores" "dolorespark" "francisco" ## [16] "get" "good" "great" ## [19] "happy" "heavyblackheart" "just" ## [22] "last" "like" "love" ## [25] "manufactory" "mission" "morning" ## [28] "muttvillesf" "new" "night" ## [31] "now" "one" "park" ## [34] "photo" "posted" "room" ## [37] "san" "sanfrancisco" "sparkles" ## [40] "street" "streetart" "tartine" ## [43] "tattoo" "theea" "time" ## [46] "today" "tonight" "valencia"
freq = data.frame(sort(colSums(as.matrix(dtm)), decreasing=TRUE)) wordcloud(rownames(freq), freq[,1], max.words=50, colors=brewer.pal(1, "Dark2"))
This is not very informative, however, about the entire corpus. I am much more interested in seeing the different TOPICS people are tweeting about. To look at this, I must perform topic modeling.
Due to the nature of the distribution (Dirichlet distribution) used in our algorithm (Latent Dirichlet Allocation) we must select a number of topics beforehand that we want to model. I think this has something to do with the fact that the distribution is used to describe probabilities of multinomials… or that "the addition of the Dirichlet distribution to the model allows us to specify our prior beliefs about what data X is likely to occur" Savov 2014. Anyway I think it has something to do with the assumptions of the distribution we are using – this distribution is used to model probabilities, which must be fractions of some whole. If it doesn't know what the whole is it can't really model how big these fractions are!
Anyway, there are two ways to determine the number of topics you want to use. The first one is to just pick a number and play around from there to see what produces topics that make sense. The second is more technical and involves harmonic means. I will not discuss that here as it takes some time, but if you are interested in trying out this method, check out Graham and Ackland 2015
On to topic modeling! As mentioned, I'll be using an Latent Dirichlet Allocation and Gibbs Sampling. There are different ways to topic model using different algorithms, but this will be the one we'll try out today.
First we have to set the parameters for Gibbs sampling. A nice explanation of Gibbs sampling and what each of these parameters mean / do is provided by Kailash Awati: "Gibbs Sampling works by performing a random walk in such a way that reflexts the characterisitics of a desired distribution. Because the starting point of the walk is chosen at random, it is necessary to discard the first few steps of the walk (as these do not correctly refelxt the properties of distribution). This is referred to as the burn-in period. We set the burn-in parameter to 4000. Following the burn-in period, we perform 2000 iterations, taking every 500th iteration for further use. The reason we do this is to avoid correlations between samples. We use 5 different sample points (nstart = 5) that is, five independent runs. Each starting point requires a seed integer (this also ensures reproducability) so I have provided five random integers in y seed list. Finally I have set best to TRUE (actually a default setting) which instructs the algorithm to return results of the run with the highest posterial probability… these settings do not guarantee the convergence of the algorithm to a globally optimal solution. Indeed, Gibbs sampling will, at best, find only a locally optimal solution, and even this is hard to prove mathematically in specific practical problems such as the one we are dealing with here. The upshot is this is that it is best to do lots of runs with different settings of parameters to check the stability of your results. The bottom line is that our interest is purely practical so if it is good enough if the results make sense".
So let's set up the parameters for the topic model
#Set parameters for Gibbs sampling (parameters those used in Grun and Hornik 2011) #'burnin' refers to the burn-in period, discarding those first few steps of the 'walk' burnin <- 4000 #'iter' refers to the number of iterations; 'thin' is what iteration we take for 'further use' -> here the 500th iteration of the 2000 iterations we've specified iter <- 2000 thin <- 500 #'seed' a list of 5 random integers seed <-list(2003,5,63,100001,765) #'n-start' is how many independent runs / sample points nstart <- 5 #'best' set to TRUE means 'instructs the algorithm to return results of the run with the highest posterial probability' best <- TRUE
I set the number of topics to 10 (I found this number to work best in terms of 'making sense' – but this can change depending on how much data you have, how varied the data is, etc.)
k<-10 #For the model we are using the document term matrix (dtm) NOT the term document matrix (tdm) #Run the model # ldaOut <-LDA(dtm,k, method="Gibbs", # control=list(nstart=nstart, seed = seed, best=best, # burnin = burnin, iter = iter, thin=thin)) #Save the output #save(ldaOut,file="ldaOut.RData")
Now we can start to explore the results. The topic model has gone through our corpus and assigned each term to a topic. Each term is also given a probability, which is the probability that it will occur in the assigned topic. We can look at the top terms in each topic to get an idea of the most probable words for each topic and come up with a qualitative description.
#Load the output back in load("ldaOut.RData") #Look at results ldaOut.topics <- as.matrix(topics(ldaOut)) topTenTermsEachTopic <- terms(ldaOut,10)
## Topic 1 Topic 2 Topic 3 Topic 4 ## [1,] "street" "just" "sanfrancisco" "new" ## [2,] "now" "photo" "bayarea" "theea" ## [3,] "back" "posted" "muttvillesf" "thanks" ## [4,] "valencia" "best" "dog" "see" ## [5,] "jasonnevermind" "got" "tattoo" "work" ## [6,] "friends" "first" "brucius" "old" ## [7,] "great" "cafe" "natural" "chocolate" ## [8,] "people" "kitchen" "science" "store" ## [9,] "like" "ever" "adoptvintagelove" "week" ## [10,] "always" "cream" "engraving" "thrift" ## Topic 5 Topic 6 Topic 7 Topic 8 ## [1,] "park" "alley" "mission" "night" ## [2,] "dolores" "clarion" "san" "tonight" ## [3,] "day" "love" "francisco" "last" ## [4,] "today" "art" "district" "come" ## [5,] "dolorespark" "streetart" "sanea" "get" ## [6,] "sparkles" "heavyblackheart" "drafthouse" "chapel" ## [7,] "brows" "mural" "alamo" "little" ## [8,] "beautiful" "amazing" "fran" "every" ## [9,] "fun" "life" "heart" "show" ## [10,] "city" "area" "days" "going" ## Topic 9 Topic 10 ## [1,] "one" "tartine" ## [2,] "happy" "coffee" ## [3,] "time" "good" ## [4,] "birthday" "room" ## [5,] "sunday" "manufactory" ## [6,] "friday" "morning" ## [7,] "mira" "elbo" ## [8,] "facewithtearsofjoy" "party" ## [9,] "bear" "colone" ## [10,] "lazy" "smilingfacewithheartshapedeyes"
I can now create a CSV (comma separated value) file to look at in Excel and apply some labels to my topics.
#Check at top 50 terms in each topic ldaOut.terms <- as.matrix(terms(ldaOut,50)) #Save as CSV file to look at a bit closer write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicstoTerms.csv"))
We can also get a look at the probabilities associated with each topic assignment. What we want is to eventually use these probabilities in a regression. Does an association with one of these topics predict a variable of interest?
topicProbabilities <- as.data.frame(ldaOut@gamma) write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
Having looked at the top 50 most probable terms for each topic, I've come up with some labels to describe them. What I want to do now is link back those topics to the tweets that have been assigned them. This involves joining two data frames together, and then replacing the numbers that describe my topics with my own invented prose versions.
#Write out the topics to a data frame so you can work with them test <- as.data.frame(ldaOut.topics) a<-c('Evaluation', 'Food','Service/Product Promos', 'Activities', 'Outdoors', 'Art', 'Places','Nightlife','Leisure','Hip Spots') b<-c(1,2,3,4,5,6,7,8,9,10) namesdf<-data.frame("Name"=a,"Number"=b) test$V1<-as.factor(test$V1) newtopics <- FindReplace(data = test, Var = "V1", replaceData = namesdf, from = "Number", to = "Name", exact = TRUE)
## Only exact matches will be replaced.
#Merge topics with tweet corpus data$ID <- 1:nrow(data) newtopics$ID <- 1:nrow(newtopics) topicdata <- merge(data,newtopics,by="ID") #Merge topic probabilities with tweet corpus topicProbabilities$ID <- 1:nrow(topicProbabilities) newdata <- merge(topicdata, topicProbabilities,by="ID") write.csv(newdata,file=paste("Tweetswtopics.csv"))
You can now map your posts and see where assigned topics are happening!
newdata$longitude<-as.numeric(newdata$longitude) newdata$latitude <- as.numeric(newdata$latitude) lon<-newdata$longitude lat<-newdata$latitude newdata$V1.x <- factor(newdata$V1.x) Topics<-newdata$V1.x mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, color=Topics), data=newdata, alpha=0.5)
mapPoints
This can be kind of messy, so we can subset our data to just look at particular topics.
#Subset the data by all those posts NOT categorized as Promos, etc. sub <- newdata[! newdata$V1.x %in% c("Activities", "Places","Service/Product Promos"),]
Look at the simplified map
sub$longitude<-as.numeric(sub$longitude) sub$latitude <- as.numeric(sub$latitude) lon<-sub$longitude lat<-sub$latitude sub$V1.x <- as.factor(sub$V1.x) Topics<- sub$V1.x mapPoints <- ggmap(map) + geom_point(aes(x = lon, y = lat, color=Topics), data=sub, alpha=0.5)
mapPoints
We can zoom into particular areas too to take a closer look at what is going on:
map2 <- get_map(location = 'Dolores Street and 19th Street, San Francisco, California', zoom = 17)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Dolores+Street+and+19th+Street,+San+Francisco,+California&zoom=17&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Dolores%20Street%20and%2019th%20Street,%20San%20Francisco,%20California&sensor=false
ggmap(map2)
mapPoints2 <- ggmap(map2) + geom_point(aes(x = lon, y = lat, color=Topics), data=sub, alpha=0.5)
mapPoints2
map3 <- get_map(location = 'Capp Street and 24th Street, San Francisco, California', zoom = 17)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Capp+Street+and+24th+Street,+San+Francisco,+California&zoom=17&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Capp%20Street%20and%2024th%20Street,%20San%20Francisco,%20California&sensor=false
ggmap(map3)
mapPoints3 <- ggmap(map3) + geom_point(aes(x = lon, y = lat, color=Topics), data=sub, alpha=0.5)
mapPoints3
We can also see how topics occur over time
#Create a data frame with number of tweets per time d2 <- as.data.frame(table(newdata$created)) d2 <- d2[order(d2$Freq, decreasing=T), ] names(d2) <- c("created","freq") #Combine this with existing data frame newdata2 <- merge(newdata,d2,by="created") #Tell R that 'created' is not an integer or factor but a time. newdata2$created <- as.POSIXct(newdata2$created, format="%m/%d/%Y %H:%M") #20 minute intervals minutes <- 60 Topics<-newdata2$V1.x
ggplot(newdata2, aes(created, color = Topics)) + geom_freqpoly(binwidth=60*minutes)
newdata3 <- newdata2[newdata2$created <= "0016-08-03 00:31:00", ] minutes <- 60 Topics<-newdata3$V1.x Freq<-newdata3$freq
ggplot(newdata3, aes(created, color = Topics)) + geom_freqpoly(binwidth=60*minutes)
I can also look at the frequency of these topics over time in a more abstract sense, by treating the posts as happening in one day to see overall patterns.
newdata$created2 <- as.POSIXct(newdata$created, format="%m/%d/%Y %H:%M") newdata$created3<-format(newdata$created2,'%H:%M:%S') d3 <- as.data.frame(table(newdata$created3)) d3 <- d3[order(d3$Freq, decreasing=T), ] names(d3) <- c("created3","freq3") newdata <- merge(newdata,d3,by="created3") newdata$created3 <- as.POSIXct(newdata$created3, format="%H:%M:%S") minutes <- 60 Topics<-newdata$V1.x overalltimes <- ggplot(newdata, aes(created3, color = Topics)) + geom_freqpoly(binwidth=60*minutes)
print(overalltimes)