Public transit routing

For the transit demand model for the PolicySpace project (see previously) we needed to be able to ingest GTFS (General Transit Feed Specification) data and use it to generate public transit routes. That is, given a start and an end stop, what's the quickest route using only public transit? Ideally we minimize both travel time and number of transfers, or strike some balance between the two.
I found an existing implementation here, based off of the papers Trip-Based Public Transit Routing (Sascha Witt, 2015) and Trip-Based Public Transit Routing Using Condensed Search Trees (Sascha Witt, 2016). The routing worked fine, but processing the Belo Horizonte GTFS data (630 routes with 9,428 stops, including buses and metro) took over 12 hours. The routing itself did not seem to have a significant speed gain from this pre-processing (unfortunately I don't have specific numbers on hand).
Instead I spent the past few days implementing my own public transit routing algorithm. It's a first attempt, and so it's naive/not heavily optimized, but so far seems to work alright. The goals were to significantly reduce pre-processing time, have a relatively low memory footprint, and quickly produce routes.
First, it's worth explaining a bit about how GTFS data is structured and the terminology it uses.
Public transit has a few components:
- stops: same as the colloquial usage; a πstop is somewhere a bus, train, etc stops to pick up and drop off, described with an id and a latitude and longitude.
- trips: a πtrip is a sequence of stops associated with a set of arrival and departure times. For example, trip
πA
goes through stopπX
at 12:00, then stopπY
at 13:00, then stopπZ
at 14:00. TripπB
goes through stopπX
at 13:00, then stopπY
at 14:00, then stopπZ
at 15:00. Even though they have the same stop sequence (πX->πY->πZ
), they arrive/depart from each stop at different times, so they are distinct trips. - route: a route is a collection of trips. A route's trips do not necessarily mean they share the exact same sequence of stops. For example, a route may have a trip that goes
πX->πY->πZ
and another trip that only goesπX->πY
. (I didn't use this data and it's confusing because we're using the term "route" to describe something different. This is the last you'll see this particular usage in this post.) - service: a service associates routes with days of operation (e.g. weekdays, weekends, off or alternative schedules on holidays, etc).
With these terms defined, we can define a public transit "route" more specifically as "a sequence of trips and their transfer stops". This is a bit more specific than how we describe routes conversationally, which is more like "take the Q line and transfer to the M at Foo Station". Instead, what we're after is more like "take the 12:42 train on the Q line, transfer to the 13:16 M at Foo Station".
GTFS data includes a few files (which are essentially CSVs but for whatever reason use the txt
extension); the important ones here are:
stop_times.txt
: describes stop sequences for trips, as well as arrival and departure times for each trip's stops.stops.txt
: describes stop latitudes and longitudes
We use some of the other data files as well, but those are more for juggling stop and trip ids and less core to the routing algorithm.
Transfer network: structure
The general approach is to generate a transfer network which describes at what stops are various trips linked together (i.e. where transfers can be made). The nodes of this network are described as (trip_id, stop_id)
tuples. Note that not all stops of a trip will be nodes in the network; only those where transfers are made. Our resulting route only needs to return trips and where to transfer, not the individual stops within trips. That's something we can easily lookup later using the stop_times.txt
data if needed.
The edges of this network can describe two different things:
- inter-trip edges: If an edge connects two nodes which share a
trip_id
, e.g.(πA, πX)->(πA, πY)
, it indicates in the network structure that these nodes are along the same trip. Here the edge weight indicates travel time between stopsπX
andπY
for tripπA
. - transfer edges: If an edge connects two nodes which don't share a
trip_id
, e.g.(πA, πX)->(πB, πX)
it indicates a transfer between trips, e.g. transfer between tripsπA
andπB
at stopπX
, and the edge weight indicates estimated transfer time.
We also need to distinguish between direct and indirect transfers:
- a direct transfer is one where the two connected trips share a stop. For example,
(πA, πX)->(πB, πX)
is a direct transfer because these trips both share the stopπX
. - an indirect transfer is one where the two connected trips do not share a stop, but there is a reasonable footpath between the stops (i.e. you could walk between stops under some threshold transfer time limit, e.g. 5 minutes). For example
(πA, πX)->(πB, πY)
where stopπY
is around the corner from stopπX
. By introducing these additional edges, we potentially find shorter routes or routes where there normally wouldn't be one.
The transfer network is directed graph. This is due to the forward arrow of time. For example, trip πA
arrives at stop πX
at 12:10 and the πB
departs from stop πX
at 12:20, which indicates a valid transfer from trip πA
to πB
. However, I can't make the transfer in reverse; if I'm on trip πB
and arrive at πX
, trip πA
has potentially already departed from πX
. If the situation was that trip πA
lingers at πX
for long enough that the transfer does work in the reverse direction, we'd just have another edge (πB, πY)->(πA, πX)
.
Transfer network: generation
The pre-processing step is almost entirely the generation of this transfer network. Once we have this network it's relatively easy to generate routes with Dijkstra's algorithm.
The generation process is broken into three parts:
- Generate direct transfer edges
- Generate indirect transfer edges
- Generate inter-trip edges
Direct transfer edge generation
We can use the stop_times.txt
file to identify direct transfers. We group rows according to stop_id
and look at the trip_id
s under each group. However, we can't just create edges between all these trips here; we have to filter in two ways:
- down to valid transfers (those that obey the arrow of time). As noted before, a trip
πA
can only transfer to a tripπB
at stopπX
if tripπA
arrives atπX
before tripπB
departs. - for each equivalent trip, select only the soonest to the incoming trip.
Expanding on 2): we assume that travellers want to make the soonest transfer possible from among a set of equivalent trips. Equivalent trips are those that share the same stop sequence, regardless of specific arrival/departure times. For example: a trip πA
with stops πX->πY->πZ
and a trip πB
with stops πX->πY->πZ
share the same stop sequence, and so are equivalent.
Say we have a trip πC
that also stops at stop πX
, arriving at 12:10. Trip πA
departs from πX
at 12:20 and trip πB
departs from πX
at 12:40. We only create the edge (πC, πX)->(A, πX)
and not (πC, πX)->(πB, πX)
because we assume no person will wait 20 extra minutes to take πB
when πA
goes along the exact same stops and departs sooner.
This assumption greatly reduces the number of edges in the network. With the Belo Horizonte data, this assumption cut edges by about 80%, roughly 18.5 million fewer edges!
Indirect transfer edge generation
To generate indirect transfers, we use the following process:
- generate a spatial index (k-d tree) of all stops
- for each stop
πX
, find then
closest stops (neighbors)- for each neighbor
πY
:- compute estimated walking time between
πX
andπY
- create edges between trips going through
πX
andπY
, sticking to the same constraints laid out for direct transfers (valid and soonest equivalent transfers)
- compute estimated walking time between
- for each neighbor
The network edge count and processing time vary with n
. Increasing n
will, of course, increase edge count and processing time. I've been using n=3
but this will likely vary depending on the specifics of the public transit network you're looking at.
Generate inter-trip edges
Finally, we generate edges between nodes that share a trip. We link these nodes in their stop sequence order. For example, for a trip πA
with stop sequence πX->πY->πZ
, where πX
and πZ
are both transfer stops, we should have a edge (πA,πX)->(πA,πZ)
. (πA,πY)
isn't in the graph because it isn't a transfer node.
Trip routing
Once the transfer network is ready we can leverage Dijkstra's algorithm to generate the trip routes. However, we want to accept any arbitrary start and end stops, whereas the nodes in our network only represent transfer stops. Not all stops are included. We use the following procedure to accommodate for this:
- Given a starting stop
πs
and departure timeπdt
, find all tripsπT_s
that go throughπs
afterπdt
. - Given a ending stop
πe
, find all tripsπT_e
that go throughπe
afterπdt
. - If there are trips in both
πT_s
andπT_e
, that means there are trips from start to finish without any transfers. Return those and finish. Otherwise, continue. - For each potential starting trip
πt_s
inπT_s
, find the transfer stop soonest afterπs
inπt_s
(note that this could beπs
itself) to be an entry node into the transfer network. We'll denote these nodes asβN_s
. For example, if we have tripπA
with stop sequenceπX->πY->πZ
andπX
is our starting stop, but not a transfer stop (meaning it's not in our transfer network), andπY
is a transfer stop, we get the node(πA, πY)
. IfπX
is a transfer stop, then we just get(πA, πX)
. - For each potential ending trip
πt_e
inπT_e
, find the transfer stops closest toπe
inπt_e
(which could beπe
itself). We'll denote these nodes asβN_e
. This is just like step 4 except we go in reverse. For example, if we have tripπA
with the stop sequenceπX->πY->πZ
andπZ
is our starting stop but not a transfer stop, andπY
is a transfer stop, then we get the node(πA, πY)
. - Create a dummy start node,
βSTART
, and create edges from it to each node inβN_s
, where the edge weight is the travel time from stopπs
to that node. - Create a dummy end node,
βEND
, and create edges from each node inβN_e
to it, where the edge weight is the travel time from the node to stopπe
. - Use Dijkstra's algorithm to find the shortest weighted path from
βSTART
toβEND
. This path is our trip route.
Conclusion
On the Belo Horizonte data (630 routes with 9,428 stops) with n=3
for indirect transfer generation we get a graph of 239,667 nodes and 4,879,935 edges. I haven't yet figured out a way to further reduce this edge count.
The transfer network generation takes about 20 minutes on my ThinkPad X260. I'm not certain a direct comparison is appropriate, but this is significantly faster than the 12 hours the other implementation took. The actual route generation doesn't seem any faster or slower than the other implementation.
While the routing is fairly fast, it definitely could be faster, and is not fast enough for direct usage in PolicySpace. We won't be able to compute individual trip routes for every one of Brazil's ~209 million inhabitants, but we were never expecting to either. This trip routing feature will be used by some higher-level interface which will cache routes and batch compute more approximate routes between regions (called "transportation analysis zones" in the transit demand modeling literature) to cut down on processing time. We're still figuring out specifics.
Anyway, this is version 1 of this trip routing method — I'd appreciate any feedback and suggestions on how to improve the approach. I need to do more testing to get a better sense of the generated routes' quality.
The code for this implementation at time of writing is available here. There are a couple names different in the code than described here, e.g. the transfer network is called a "trip network", but the algorithm itself should be the same.