Pop/Soda/Coke According to Twitter

I have always been impressed with the Pop/Soda map created by Alan McConchie (here) so I decided to create my own. Instead of asking for user submissions like McConchie, I asked Twitter to decide. To accomplish this, I wrote a script in Python that took advantage of the Twitter API and recorded the location of every geotagged tweet that included a reference to “pop”, “soda” or “coke”. Because pop/soda/coke can have multiple meanings and usages, I created a simple machine learning classifier to better determine if a tweet should be included or not. Below are the results.

Because the Twitter API only allows access to roughly 1% of the full Twitter stream, and geotagged tweets only make up roughly 1% of total tweets, it took a while to get enough tweets. From July 2015 until present (September 2016) the Twitter API provided only 67,495 tweets containing the words “pop”, “soda”, or “coke”. Because these words can be used in many ways I had to create a filter to allow only the references I wanted. To do this, I had to implement some machine learning courtesy of SciKit-Learn and Python. I first created a random forest decision tree classifier which got me most of the way there. To classify the remaining tweets I vectorized the tweet text and utilized the Bayesian classifier. This enabled me to ignore tweets such as: “King of Pop”, “pop music”, “rum and coke” or “come to the Nike Pop-Up” and other similar useless tweets. After this filtering, the population of usable tweets substantially decreased to under 10k total tweets.

Overall I’m fairly disappointed with the initial results. There does seem to be a few areas that match the trends found in McConchie’s map which is good to see. But ultimately I need to keep harvesting more tweets so that I can have more data to work with. Because of this, I’ll continue to run my Python script, and will likely update this map in a year or so…