TL;DR - Exploring New York City via GitHub Actions, OpenStreetMap, and Strava.
There are few things I find consistently more enjoyable than exploring my surroundings from the perspective of a bicycle. Whether it’s big cities or narrow country roads, there are details which can only be experienced from this perspective. Whenever I’m biking around and come across something I know most people don’t get to experience, I’m often reminded of the following:
The world reveals itself to those who travel on foot. — Werner Herzog
Traversing the world by bike, silently under your own power, unencumbered by traffic, parking, or roads in general is a freedom second to none. Just yourself and your willingness to see what’s around the next corner. I’ve been living in Manhattan for roughly three years now and over that time period have encountered some pretty incredible sights.
Anyway, I’ve been a fan of GitHub Actions ever since they rolled out the service and have been looking for use cases other than the typical CI/CD scenario. I’ve had a weekly Goodreads cron job running for a while which scrapes my reading history and populates a table but that’s not actually very interesting. It finally dawned on me that I should build out some kind of service to track my bike exploration progress.
I started out with a simple Python script which scrapes the latitude/longitude pairs from each new Strava activity and persists them as downsampled SVGs via the Visvalingam–Whyatt algorithm. I then created a new "nyc biking page", set up a cron job running daily at midnight, and quickly got hooked on seeing my progress.
Next step was actually quantifying my exploration progress. I started out by downloading the OpenStreetMap (OSM) road network graph for New York City (minus Statin Island) via OSMnx. As a first pass, I thought I might be able to get away with simply checking off the roads which came within some radius (say 10m) of the gps path scraped from Strava, but this yielded predictably terrible results. Turns out the process of matching gps points to a standardized graph is called map matching and there exists a whole body of research around it. After a quick literature review I decided on implementing the paper: Fast Hidden Markov Model Map-Matching for Sparse and Noisy Trajectories which is an improvement of the earlier paper: Hidden Markov map matching through noise and sparseness. The implementation was fairly straight forward with the hidden markov model (HMM) represented as a trellis style graph using NetworkX. I was able to then simply call the bidirectional dijkstra method which produces the shortest path through the HMM trellis. It works pretty well and I was able to leverage my Taxicab routing module which was satisfying. Performance is decent too, because I indexed the New York City OSM network graph as a ball tree I can process ~500 gps paths (several thousand points each) in roughly ~2min. Below is a fairly typical example of the map matching process across a small portion of a much larger path. The gps path (slightly downsampled) is pretty bad due to the urban canyon nature of midtown Manhattan, but the algorithm was able to resolve the correct path pretty well.
With the map matching process dialed in, I can now track which roads I’ve covered, the date when I did so, as well as the total number of times I’ve been on that particular road segment. All of these details are persisted as edge attributes allowing for a bunch of interesting analysis possibilities. The only remaining issue was dealing with storage. The NYC OSM graph in xml format takes up ~79MB which is a little too big for a GitHub repo. Luckily, a gzipped pickle version only takes up ~16MB!
The last step was deciding on what progress metrics I wanted to display. I’ve been experimenting with a few options, but have found the following to be the most useful/motivating.
Where, “Explored” represents the number of road miles covered divided by the total number of road miles in that borough. And “Efficiency” is the number of new road miles covered divided by the total number of road miles for that individual activity. Here the 40 most recent activities are displayed with a 10 activity moving average. These graphs are still very much a work in progress though, and the art of gamifying them has been fun. So far I’ve learn that metrics involving time (progress/week, etc..) forced me into biking around when I wasn’t in the mood simply because I didn’t want a low score which is basically the exact opposite of what I’m going for. Anyway, I’m thinking the next addition will include some kind of heat map which I can then reference while I’m out riding. Something to highlight the small unexplored pockets in between the major destinations.
Thanks for reading!