For the transit demand model for the PolicySpace project (see previously) we needed to be able to ingest GTFS (General Transit Feed Specification) data and use it to generate public transit routes. That is, given a start and an end stop, what's the quickest route using only public transit? Ideally we minimize both travel time and number of transfers, or strike some balance between the two.
I found an existing implementation here, based off of the papers Trip-Based Public Transit Routing (Sascha Witt, 2015) and Trip-Based Public Transit Routing Using Condensed Search Trees (Sascha Witt, 2016). The routing worked fine, but processing the Belo Horizonte GTFS data (630 routes with 9,428 stops, including buses and metro) took over 12 hours. The routing itself did not seem to have a significant speed gain from this pre-processing (unfortunately I don't have specific numbers on hand).
Instead I spent the past few days implementing my own public transit routing algorithm. It's a first attempt, and so it's naive/not heavily optimized, but so far seems to work alright. The goals were to significantly reduce pre-processing time, have a relatively low memory footprint, and quickly produce routes.
First, it's worth explaining a bit about how GTFS data is structured and the terminology it uses.
Public transit has a few components:
- stops: same as the colloquial usage; a 🚏stop is somewhere a bus, train, etc stops to pick up and drop off, described with an id and a latitude and longitude.
- trips: a 🚌trip is a sequence of stops associated with a set of arrival and departure times. For example, trip
🚌Agoes through stop
🚏Xat 12:00, then stop
🚏Yat 13:00, then stop
🚏Zat 14:00. Trip
🚌Bgoes through stop
🚏Xat 13:00, then stop
🚏Yat 14:00, then stop
🚏Zat 15:00. Even though they have the same stop sequence (
🚏X->🚏Y->🚏Z), they arrive/depart from each stop at different times, so they are distinct trips.
- route: a route is a collection of trips. A route's trips do not necessarily mean they share the exact same sequence of stops. For example, a route may have a trip that goes
🚏X->🚏Y->🚏Zand another trip that only goes
🚏X->🚏Y. (I didn't use this data and it's confusing because we're using the term "route" to describe something different. This is the last you'll see this particular usage in this post.)
- service: a service associates routes with days of operation (e.g. weekdays, weekends, off or alternative schedules on holidays, etc).
With these terms defined, we can define a public transit "route" more specifically as "a sequence of trips and their transfer stops". This is a bit more specific than how we describe routes conversationally, which is more like "take the Q line and transfer to the M at Foo Station". Instead, what we're after is more like "take the 12:42 train on the Q line, transfer to the 13:16 M at Foo Station".
GTFS data includes a few files (which are essentially CSVs but for whatever reason use the
txt extension); the important ones here are:
stop_times.txt: describes stop sequences for trips, as well as arrival and departure times for each trip's stops.
stops.txt: describes stop latitudes and longitudes
We use some of the other data files as well, but those are more for juggling stop and trip ids and less core to the routing algorithm.
Transfer network: structure
The general approach is to generate a transfer network which describes at what stops are various trips linked together (i.e. where transfers can be made). The nodes of this network are described as
(trip_id, stop_id) tuples. Note that not all stops of a trip will be nodes in the network; only those where transfers are made. Our resulting route only needs to return trips and where to transfer, not the individual stops within trips. That's something we can easily lookup later using the
stop_times.txt data if needed.
The edges of this network can describe two different things:
- inter-trip edges: If an edge connects two nodes which share a
(🚌A, 🚏X)->(🚌A, 🚏Y), it indicates in the network structure that these nodes are along the same trip. Here the edge weight indicates travel time between stops
- transfer edges: If an edge connects two nodes which don't share a
(🚌A, 🚏X)->(🚌B, 🚏X)it indicates a transfer between trips, e.g. transfer between trips
🚏X, and the edge weight indicates estimated transfer time.
We also need to distinguish between direct and indirect transfers:
- a direct transfer is one where the two connected trips share a stop. For example,
(🚌A, 🚏X)->(🚌B, 🚏X)is a direct transfer because these trips both share the stop
- an indirect transfer is one where the two connected trips do not share a stop, but there is a reasonable footpath between the stops (i.e. you could walk between stops under some threshold transfer time limit, e.g. 5 minutes). For example
(🚌A, 🚏X)->(🚌B, 🚏Y)where stop
🚏Yis around the corner from stop
🚏X. By introducing these additional edges, we potentially find shorter routes or routes where there normally wouldn't be one.
The transfer network is directed graph. This is due to the forward arrow of time. For example, trip
🚌A arrives at stop
🚏X at 12:10 and the
🚌B departs from stop
🚏X at 12:20, which indicates a valid transfer from trip
🚌B. However, I can't make the transfer in reverse; if I'm on trip
🚌B and arrive at
🚌A has potentially already departed from
🚏X. If the situation was that trip
🚌A lingers at
🚏X for long enough that the transfer does work in the reverse direction, we'd just have another edge
(🚌B, 🚏Y)->(🚌A, 🚏X).
Transfer network: generation
The pre-processing step is almost entirely the generation of this transfer network. Once we have this network it's relatively easy to generate routes with Dijkstra's algorithm.
The generation process is broken into three parts:
- Generate direct transfer edges
- Generate indirect transfer edges
- Generate inter-trip edges
Direct transfer edge generation
We can use the
stop_times.txt file to identify direct transfers. We group rows according to
stop_id and look at the
trip_ids under each group. However, we can't just create edges between all these trips here; we have to filter in two ways:
- down to valid transfers (those that obey the arrow of time). As noted before, a trip
🚌Acan only transfer to a trip
- for each equivalent trip, select only the soonest to the incoming trip.
Expanding on 2): we assume that travellers want to make the soonest transfer possible from among a set of equivalent trips. Equivalent trips are those that share the same stop sequence, regardless of specific arrival/departure times. For example: a trip
🚌A with stops
🚏X->🚏Y->🚏Z and a trip
🚌B with stops
🚏X->🚏Y->🚏Z share the same stop sequence, and so are equivalent.
Say we have a trip
🚌C that also stops at stop
🚏X, arriving at 12:10. Trip
🚌A departs from
🚏X at 12:20 and trip
🚌B departs from
🚏X at 12:40. We only create the edge
(🚌C, 🚏X)->(A, 🚏X) and not
(🚌C, 🚏X)->(🚌B, 🚏X) because we assume no person will wait 20 extra minutes to take
🚌A goes along the exact same stops and departs sooner.
This assumption greatly reduces the number of edges in the network. With the Belo Horizonte data, this assumption cut edges by about 80%, roughly 18.5 million fewer edges!
Indirect transfer edge generation
To generate indirect transfers, we use the following process:
- generate a spatial index (k-d tree) of all stops
- for each stop
🚏X, find the
nclosest stops (neighbors)
- for each neighbor
- compute estimated walking time between
- create edges between trips going through
🚏Y, sticking to the same constraints laid out for direct transfers (valid and soonest equivalent transfers)
- compute estimated walking time between
- for each neighbor
The network edge count and processing time vary with
n will, of course, increase edge count and processing time. I've been using
n=3 but this will likely vary depending on the specifics of the public transit network you're looking at.
Generate inter-trip edges
Finally, we generate edges between nodes that share a trip. We link these nodes in their stop sequence order. For example, for a trip
🚌A with stop sequence
🚏Z are both transfer stops, we should have a edge
(🚌A,🚏Y) isn't in the graph because it isn't a transfer node.
Once the transfer network is ready we can leverage Dijkstra's algorithm to generate the trip routes. However, we want to accept any arbitrary start and end stops, whereas the nodes in our network only represent transfer stops. Not all stops are included. We use the following procedure to accommodate for this:
- Given a starting stop
🚏sand departure time
🕒dt, find all trips
🚌T_sthat go through
- Given a ending stop
🚏e, find all trips
🚌T_ethat go through
- If there are trips in both
🚌T_e, that means there are trips from start to finish without any transfers. Return those and finish. Otherwise, continue.
- For each potential starting trip
🚌T_s, find the transfer stop soonest after
🚌t_s(note that this could be
🚏sitself) to be an entry node into the transfer network. We'll denote these nodes as
⭕N_s. For example, if we have trip
🚌Awith stop sequence
🚏Xis our starting stop, but not a transfer stop (meaning it's not in our transfer network), and
🚏Yis a transfer stop, we get the node
(🚌A, 🚏Y). If
🚏Xis a transfer stop, then we just get
- For each potential ending trip
🚌T_e, find the transfer stops closest to
🚌t_e(which could be
🚏eitself). We'll denote these nodes as
⭕N_e. This is just like step 4 except we go in reverse. For example, if we have trip
🚌Awith the stop sequence
🚏Zis our starting stop but not a transfer stop, and
🚏Yis a transfer stop, then we get the node
- Create a dummy start node,
⭕START, and create edges from it to each node in
⭕N_s, where the edge weight is the travel time from stop
🚏sto that node.
- Create a dummy end node,
⭕END, and create edges from each node in
⭕N_eto it, where the edge weight is the travel time from the node to stop
- Use Dijkstra's algorithm to find the shortest weighted path from
⭕END. This path is our trip route.
On the Belo Horizonte data (630 routes with 9,428 stops) with
n=3 for indirect transfer generation we get a graph of 239,667 nodes and 4,879,935 edges. I haven't yet figured out a way to further reduce this edge count.
The transfer network generation takes about 20 minutes on my ThinkPad X260. I'm not certain a direct comparison is appropriate, but this is significantly faster than the 12 hours the other implementation took. The actual route generation doesn't seem any faster or slower than the other implementation.
While the routing is fairly fast, it definitely could be faster, and is not fast enough for direct usage in PolicySpace. We won't be able to compute individual trip routes for every one of Brazil's ~209 million inhabitants, but we were never expecting to either. This trip routing feature will be used by some higher-level interface which will cache routes and batch compute more approximate routes between regions (called "transportation analysis zones" in the transit demand modeling literature) to cut down on processing time. We're still figuring out specifics.
Anyway, this is version 1 of this trip routing method — I'd appreciate any feedback and suggestions on how to improve the approach. I need to do more testing to get a better sense of the generated routes' quality.
The code for this implementation at time of writing is available here. There are a couple names different in the code than described here, e.g. the transfer network is called a "trip network", but the algorithm itself should be the same.