November 3, 2016

Linguistic Landscapes

Linguistic Landscapes is a subfield of sociolinguistics which encompasses study of "the visibility and salience of languages on public and commercial signs in a given territory or region" (Landry and Bourhis 1997).

Linguistic Landscapes on Social Media

  • Continuing efforts in LL to recognize ALL aspects of LL

Getting the data

With Twitter's search API you can search for tweets that have been 'geotagged': tweets that have longitude and latitude coordinates attached to them.

My research is based in the Mission District neighborhood in San Francisco, so I am using coordinates that will get me tweets from that area. I've chosen a radius of 1 km from this central coordinate but you can expand this radius.

Getting the data

geo=searchTwitter('',n=100000, geocode='37.76,-122.42,1km',
                  retryOnRateLimit=1)
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 100000 tweets were requested but the
## API can only return 2028

This could take some time depending on how fast your Internet is and how many tweets are available. I was optimistic with my request for 100,000 tweets, so I receive an error message that says how many were actually available. You can play around with this number too if you are lucky enough to have more than 100,000 tweets available.

Processing the data

Now you have a list of tweets. Lists are very difficult to deal with in R, so you convert this into a data frame:

geoDF<-twListToDF(geo)

Processing the data

Chances are there will be emojis in your Twitter data. You can 'transform' these emojis into prose using this code as well as a CSV file I've put together of what all of the emojis look like in R. (The idea for this comes from Jessica Peterka-Bonetta's work – she has a list of emojis as well, but it does not include the newest batch of emojis nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.

Processing the data

Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.

emoticons <- read.csv("Decoded Emojis Col Sep.csv", header = T)

To transform the emojis, you first need to transform the tweet data into ASCII:

geoDF$text <- iconv(geoDF$text, from = "latin1", to = "ascii", 
                    sub = "byte")

Processing the data

To 'count' the emojis you do a find and replace using the CSV file of 'Decoded Emojis' as a reference. Here I am using the DataCombine package. What this does is identifies emojis in the tweeted Instagram posts and then replaces them with a prose version. I used whatever description pops up when hovering one's cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.

emojireplace <- FindReplace(data = geoDF, Var = "text", 
                            replaceData = emoticons,
                       from = "R_Encoding", to = "Name", 
                       exact = FALSE)

Processing the data

Now might be a good time to save this file. I save it in a CSV format with the date of when I collected the posts.

write.csv(emojireplace,file=paste("ALL",Sys.Date(),".csv"))

Processing the data

Now you have a data frame which you can manipulate in various ways. For my research, I'm just interested in posts that have occured on Instagram. (Why not just access them via Instagram's API you ask? Long story short: they are very very conservative about providing access for academic research). I've found a work-around which is filtering mined tweets by those that have Instagram as a source:

data <- emojireplace[emojireplace$statusSource == 
        "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>", ]

#Save this file
write.csv(data,file=paste("INSTA",Sys.Date(),".csv"))

A note about the data

Important: Obviously, data collected this way are not representative of all Instagram posts made in the Mission District (as we depend on people who cross-post to Twitter which is most likely the minority of Mission District Instagrammers) however this is an important point about any data obtained via social media: it's never truly representative. Partly because individuals must be assumed to be selective when they post, as posting is an inherently subjective process, and not everyone is active on social media.

Visualizing the data

Now let's play around with visualizing the data. I want to superimpose different aspects of the tweets I collected on a map. First I have to get a map, which I do using the ggmap package which interacts with Google Map's API. When you use this package, be sure to cite it, as it requests you to when you first load the package into your library.

Visualizing the data

I request a map of the Mission District, and then check to make sure the map is what I want (in terms of zoom, area covered, etc.)

map <- get_map(location = 'Capp St. and 20th, San Francisco,
               California', zoom = 15)

ggmap(map)