Public transit routing

For the transit demand model for the PolicySpace project (see previously) we needed to be able to ingest GTFS (General Transit Feed Specification) data and use it to generate public transit routes. That is, given a start and an end stop, what's the quickest route using only public transit? Ideally we minimize both travel time and number of transfers, or strike some balance between the two.

I found an existing implementation here, based off of the papers Trip-Based Public Transit Routing (Sascha Witt, 2015) and Trip-Based Public Transit Routing Using Condensed Search Trees (Sascha Witt, 2016). The routing worked fine, but processing the Belo Horizonte GTFS data (630 routes with 9,428 stops, including buses and metro) took over 12 hours. The routing itself did not seem to have a significant speed gain from this pre-processing (unfortunately I don't have specific numbers on hand).

Instead I spent the past few days implementing my own public transit routing algorithm. It's a first attempt, and so it's naive/not heavily optimized, but so far seems to work alright. The goals were to significantly reduce pre-processing time, have a relatively low memory footprint, and quickly produce routes.

First, it's worth explaining a bit about how GTFS data is structured and the terminology it uses.

Public transit has a few components:

  • stops: same as the colloquial usage; a 🚏stop is somewhere a bus, train, etc stops to pick up and drop off, described with an id and a latitude and longitude.
  • trips: a 🚌trip is a sequence of stops associated with a set of arrival and departure times. For example, trip 🚌A goes through stop 🚏X at 12:00, then stop 🚏Y at 13:00, then stop 🚏Z at 14:00. Trip 🚌B goes through stop 🚏X at 13:00, then stop 🚏Y at 14:00, then stop 🚏Z at 15:00. Even though they have the same stop sequence (🚏X->🚏Y->🚏Z), they arrive/depart from each stop at different times, so they are distinct trips.
  • route: a route is a collection of trips. A route's trips do not necessarily mean they share the exact same sequence of stops. For example, a route may have a trip that goes 🚏X->🚏Y->🚏Z and another trip that only goes 🚏X->🚏Y. (I didn't use this data and it's confusing because we're using the term "route" to describe something different. This is the last you'll see this particular usage in this post.)
  • service: a service associates routes with days of operation (e.g. weekdays, weekends, off or alternative schedules on holidays, etc).

With these terms defined, we can define a public transit "route" more specifically as "a sequence of trips and their transfer stops". This is a bit more specific than how we describe routes conversationally, which is more like "take the Q line and transfer to the M at Foo Station". Instead, what we're after is more like "take the 12:42 train on the Q line, transfer to the 13:16 M at Foo Station".

GTFS data includes a few files (which are essentially CSVs but for whatever reason use the txt extension); the important ones here are:

  • stop_times.txt: describes stop sequences for trips, as well as arrival and departure times for each trip's stops.
  • stops.txt: describes stop latitudes and longitudes

We use some of the other data files as well, but those are more for juggling stop and trip ids and less core to the routing algorithm.

Transfer network: structure

The general approach is to generate a transfer network which describes at what stops are various trips linked together (i.e. where transfers can be made). The nodes of this network are described as (trip_id, stop_id) tuples. Note that not all stops of a trip will be nodes in the network; only those where transfers are made. Our resulting route only needs to return trips and where to transfer, not the individual stops within trips. That's something we can easily lookup later using the stop_times.txt data if needed.

The edges of this network can describe two different things:

  • inter-trip edges: If an edge connects two nodes which share a trip_id, e.g. (🚌A, 🚏X)->(🚌A, 🚏Y), it indicates in the network structure that these nodes are along the same trip. Here the edge weight indicates travel time between stops 🚏X and 🚏Y for trip 🚌A.
  • transfer edges: If an edge connects two nodes which don't share a trip_id, e.g. (🚌A, 🚏X)->(🚌B, 🚏X) it indicates a transfer between trips, e.g. transfer between trips 🚌A and 🚌B at stop 🚏X, and the edge weight indicates estimated transfer time.

We also need to distinguish between direct and indirect transfers:

  • a direct transfer is one where the two connected trips share a stop. For example, (🚌A, 🚏X)->(🚌B, 🚏X) is a direct transfer because these trips both share the stop 🚏X.
  • an indirect transfer is one where the two connected trips do not share a stop, but there is a reasonable footpath between the stops (i.e. you could walk between stops under some threshold transfer time limit, e.g. 5 minutes). For example (🚌A, 🚏X)->(🚌B, 🚏Y) where stop 🚏Y is around the corner from stop 🚏X. By introducing these additional edges, we potentially find shorter routes or routes where there normally wouldn't be one.

The transfer network is directed graph. This is due to the forward arrow of time. For example, trip 🚌A arrives at stop 🚏X at 12:10 and the 🚌B departs from stop 🚏X at 12:20, which indicates a valid transfer from trip 🚌A to 🚌B. However, I can't make the transfer in reverse; if I'm on trip 🚌B and arrive at 🚏X, trip 🚌A has potentially already departed from 🚏X. If the situation was that trip 🚌A lingers at 🚏X for long enough that the transfer does work in the reverse direction, we'd just have another edge (🚌B, 🚏Y)->(🚌A, 🚏X).

Transfer network: generation

The pre-processing step is almost entirely the generation of this transfer network. Once we have this network it's relatively easy to generate routes with Dijkstra's algorithm.

The generation process is broken into three parts:

  1. Generate direct transfer edges
  2. Generate indirect transfer edges
  3. Generate inter-trip edges

Direct transfer edge generation

We can use the stop_times.txt file to identify direct transfers. We group rows according to stop_id and look at the trip_ids under each group. However, we can't just create edges between all these trips here; we have to filter in two ways:

  1. down to valid transfers (those that obey the arrow of time). As noted before, a trip 🚌A can only transfer to a trip 🚌B at stop 🚏X if trip 🚌A arrives at 🚏X before trip 🚌B departs.
  2. for each equivalent trip, select only the soonest to the incoming trip.

Expanding on 2): we assume that travellers want to make the soonest transfer possible from among a set of equivalent trips. Equivalent trips are those that share the same stop sequence, regardless of specific arrival/departure times. For example: a trip 🚌A with stops 🚏X->🚏Y->🚏Z and a trip 🚌B with stops 🚏X->🚏Y->🚏Z share the same stop sequence, and so are equivalent.

Say we have a trip 🚌C that also stops at stop 🚏X, arriving at 12:10. Trip 🚌A departs from 🚏X at 12:20 and trip 🚌B departs from 🚏X at 12:40. We only create the edge (🚌C, 🚏X)->(A, 🚏X) and not (🚌C, 🚏X)->(🚌B, 🚏X) because we assume no person will wait 20 extra minutes to take 🚌B when 🚌A goes along the exact same stops and departs sooner.

This assumption greatly reduces the number of edges in the network. With the Belo Horizonte data, this assumption cut edges by about 80%, roughly 18.5 million fewer edges!

Indirect transfer edge generation

To generate indirect transfers, we use the following process:

  1. generate a spatial index (k-d tree) of all stops
  2. for each stop 🚏X, find the n closest stops (neighbors)
    1. for each neighbor 🚏Y:
      1. compute estimated walking time between 🚏X and 🚏Y
      2. create edges between trips going through 🚏X and 🚏Y, sticking to the same constraints laid out for direct transfers (valid and soonest equivalent transfers)

The network edge count and processing time vary with n. Increasing n will, of course, increase edge count and processing time. I've been using n=3 but this will likely vary depending on the specifics of the public transit network you're looking at.

Generate inter-trip edges

Finally, we generate edges between nodes that share a trip. We link these nodes in their stop sequence order. For example, for a trip 🚌A with stop sequence 🚏X->🚏Y->🚏Z, where 🚏X and 🚏Z are both transfer stops, we should have a edge (🚌A,🚏X)->(🚌A,🚏Z). (🚌A,🚏Y) isn't in the graph because it isn't a transfer node.

Trip routing

Once the transfer network is ready we can leverage Dijkstra's algorithm to generate the trip routes. However, we want to accept any arbitrary start and end stops, whereas the nodes in our network only represent transfer stops. Not all stops are included. We use the following procedure to accommodate for this:

  1. Given a starting stop 🚏s and departure time πŸ•’dt, find all trips 🚌T_s that go through 🚏s after πŸ•’dt.
  2. Given a ending stop 🚏e, find all trips 🚌T_e that go through 🚏e after πŸ•’dt.
  3. If there are trips in both 🚌T_s and 🚌T_e, that means there are trips from start to finish without any transfers. Return those and finish. Otherwise, continue.
  4. For each potential starting trip 🚌t_s in 🚌T_s, find the transfer stop soonest after 🚏s in 🚌t_s (note that this could be 🚏s itself) to be an entry node into the transfer network. We'll denote these nodes as β­•N_s. For example, if we have trip 🚌A with stop sequence 🚏X->🚏Y->🚏Z and 🚏X is our starting stop, but not a transfer stop (meaning it's not in our transfer network), and 🚏Y is a transfer stop, we get the node (🚌A, 🚏Y). If 🚏X is a transfer stop, then we just get (🚌A, 🚏X).
  5. For each potential ending trip 🚌t_e in 🚌T_e, find the transfer stops closest to 🚏e in 🚌t_e (which could be 🚏e itself). We'll denote these nodes as β­•N_e. This is just like step 4 except we go in reverse. For example, if we have trip 🚌A with the stop sequence 🚏X->🚏Y->🚏Z and 🚏Z is our starting stop but not a transfer stop, and 🚏Y is a transfer stop, then we get the node (🚌A, 🚏Y).
  6. Create a dummy start node, β­•START, and create edges from it to each node in β­•N_s, where the edge weight is the travel time from stop 🚏s to that node.
  7. Create a dummy end node, β­•END, and create edges from each node in β­•N_e to it, where the edge weight is the travel time from the node to stop 🚏e.
  8. Use Dijkstra's algorithm to find the shortest weighted path from β­•START to β­•END. This path is our trip route.


On the Belo Horizonte data (630 routes with 9,428 stops) with n=3 for indirect transfer generation we get a graph of 239,667 nodes and 4,879,935 edges. I haven't yet figured out a way to further reduce this edge count.

The transfer network generation takes about 20 minutes on my ThinkPad X260. I'm not certain a direct comparison is appropriate, but this is significantly faster than the 12 hours the other implementation took. The actual route generation doesn't seem any faster or slower than the other implementation.

While the routing is fairly fast, it definitely could be faster, and is not fast enough for direct usage in PolicySpace. We won't be able to compute individual trip routes for every one of Brazil's ~209 million inhabitants, but we were never expecting to either. This trip routing feature will be used by some higher-level interface which will cache routes and batch compute more approximate routes between regions (called "transportation analysis zones" in the transit demand modeling literature) to cut down on processing time. We're still figuring out specifics.

Anyway, this is version 1 of this trip routing method — I'd appreciate any feedback and suggestions on how to improve the approach. I need to do more testing to get a better sense of the generated routes' quality.

The code for this implementation at time of writing is available here. There are a couple names different in the code than described here, e.g. the transfer network is called a "trip network", but the algorithm itself should be the same.

ProtonMail Bridge & Mutt

ProtonMail recently released their Linux beta for Bridge, which provides IMAP/SMTP access to the service. Prior to Bridge you could only access the service through the web interface, which is sort of clunky and requires you to, among other things, rely on their search, which is limited by the fact that they can't really index your emails - because you're paying them not to read the message bodies!

ProtonMail provides instructions for setting up the Bridge with common email applications like Thunderbird, but that's about it. So here's how to set it up with NeoMutt and OfflineIMAP for fetching our emails.

(My full email setup also includes the common Mutt companions NotMuch for better searching and urlscan for viewing URLs more easily in emails, in addition to some custom scripts, such as one for viewing HTML emails in a nice popover window and one for viewing MHT/MHTML emails (which are emails that contain inline attachments). It's too much to cover here, but if you want to poke around these scripts and my full email configs (at time of writing), see my dippindots.)

Installing NeoMutt and OfflineIMAP

These instructions are for Ubuntu 16.04, but I imagine they aren't much different for other distributions (yours might even have a package you can install).

Install dependencies:

sudo apt install -y xsltproc libidn11-dev libsasl2-dev libnotmuch-dev --no-install-recommends

Then grab the latest NeoMutt release, extract, and build:

./configure --disable-doc --ssl --sasl --notmuch
sudo make install

# so we can just access it via `mutt`
sudo ln -s /usr/bin/neomutt /usr/bin/mutt

Then install OfflineIMAP:

sudo pip install offlineimap

Running the Bridge

The Bridge can be run from the command line with the Desktop-Bridge program. By default this opens a GUI to setup your accounts, but you can also access a console interface with Desktop-Bridge --cli.

If you aren't already logged in you need to run the login command in this interface.

Configuring OfflineIMAP

First thing to do is configure OfflineIMAP to access our ProtonMail emails.

OfflineIMAP looks for a config at ~/.offlineimaprc. My config at time of writing is:

accounts = main

[Account main]
localrepository = main-local
remoterepository = main-remote

# full refresh, in min
autorefresh = 0.2

# quick refreshs between each full refresh
quick = 10

# update notmuch index after sync
postsynchook = notmuch new

[Repository main-local]
type = Maildir
localfolders = ~/.mail

# delete remote mails that were deleted locally
sync_deletes = yes

[Repository main-remote]
type = IMAP
remoteport = 1143
remotehost =
remoteuser = <YOUR EMAIL>
keepalive = 60
holdconnectionopen = yes

# delete local mails that were deleted on the remote server
expunge = yes

# sync only these folders
folderfilter = lambda foldername: foldername in ['INBOX', 'Archive', 'Sent']

# is broken, but connecting locally to bridge so should be ok
ssl = no

Basically this sets up an account arbitrarily called main which will store emails at ~/.mail in the Maildir format. It will only sync the INBOX, Archive, and Sent folders to/from ProtonMail (the folderfilter option). Emails deleted locally will also be deleted on ProtonMail (the sync_deletes option) and emails deleted on ProtonMail will be deleted locally (the expunge option).

After OfflineIMAP fetches new email, it will run the command defined for postsynchook, which in this case is is the notmuch command for updating its search index (notmuch new).

Important Bridge-related things to note:

  • Bridge generates a Bridge-specific password for you to use, so use that here and not your actual ProtonMail password.
  • Bridge's IMAP service runs at (normally IMAP runs on port 143 or 993 for SSL)
  • Disable SSL because it was (at least when I set this up) not working with Bridge. But this seems like a non-issue because it's over a local connection anyways and the actual outgoing connection to ProtonMail is encrypted.

Then try running it using the offlineimap command.

Configuring NeoMutt

There is a lot to configure in NeoMutt, so I'll only cover what is necessary to get this setup working. If you're interested in seeing more, my NeoMutt config at time of writing is available here.

NeoMutt looks for a config at ~/.muttrc. To get it working with OfflineIMAP and to send emails with SMTP you need at least:

# "+" substitutes for `folder`
set mbox_type=Maildir
set folder=~/.mail/
set record=+Sent
set postponed=+Drafts
set trash=+Trash
set mail_check=2 # seconds

# smtp
source ~/docs/keys/mail
set smtp_url=smtp://$my_user:$my_pass@
set ssl_force_tls
set ssl_starttls

Where my ~/docs/keys/mail file has contents in the format:

set my_user=<YOUR EMAIL>

Important Bridge-related notes:

  • The SMTP port is 1025 (typically it's 587)
  • See the previous note on Bridge-specific password

That should be all you need.

"Daemonizing" the Bridge

There currently is no way to daemonize the Bridge, but here's a workaround using tmux:

tmux new-session -d -s mail 'Desktop-Bridge --cli'

This just opens up a new tmux session and runs the Bridge inside of it.

Finding closest edge positions in a spatial network

I encountered an interesting challenge the other day while working on a transit demand model for the PolicySpace project (under Bernardo Furtado at the IPEA/Institute of Applied Economic Research).

We have bus data for Belo Horizonte, one of the largest cities in Brazil (and South America). Specifically, we have GTFS (General Transit Feed Specification) data with geographic coordinates of bus stops.

We also have a network of the city's roads, with data provided by OpenStreetMaps and accessed with the osmnx library. The library provides us with the geographic and UTM (Universal Transverse Mercator) projected coordinates for nodes as well as line geometries for most of the edges (edges which are just straight lines don't have geometries provided, but they can be inferred from the coordinates of their start and end nodes).

How do we meaningfully map these bus stops to our road network? The first instinct might be to associate them with their closest node, but many bus stops start somewhere along a road rather than at the start or end of one.

So we need to be able to find the closest point along a road for a given bus stop. This is trickier because whereas nodes can be treated as single points, roads potentially have winding geometries. A road which starts and ends far from a bus stop might nevertheless have a very close point to the bus stop somewhere along it. We want to accommodate these kinds of cases.

Goal: find closest point on closest edge to a bus stop's coordinates
Goal: find closest point on closest edge to a bus stop's coordinates

The first step is to find candidate edges that are close enough to the bus stop to be plausible roads for it. We have bounding boxes for each edge's geometry, so we can accomplish this search with a quadtree, using Pyqtree. A quadtree is a data structure that indexes 2D boxes and allows you to quickly find boxes which overlap a given query box. It's sort of like a kd-tree but can search boxes in addition to points.

To demonstrate this process, I'll use a dummy road network consisting of a few paths:

bbox = (-0.3, -0.3, 1.4, 1.4)
paths = [
    [(0, 0), (0.2, 0.6), (0.5, 0.9), (1, 1)],
    [(0.8, 0.2), (0.5, 0.8), (0.5, 0.9)],
    [(0.4, 0.05), (0.5, 0.1), (0.6, 0.07), (0.65, 0)],
    [(0, 1.2), (0.2, 0.9), (0.2, 0.65), (0.3, 0.4), (0.3, 0.3)],

Which looks like:

Dummy road network
Dummy road network

So we index all our edges' bounding boxes into this quadtree. We pad the bounding boxes by some configurable radius, since the bus stop is likely to be in the vicinity of an edge's borders.

from pyqtree import Index
from shapely import geometry

# box padding
radius = 0.1

# keep track of lines so we can
# recover them later
lines = []

# initialize the quadtree index
idx = Index(bbox)

# add edge bounding boxes to the index
for i, path in enumerate(paths):
    # create line geometry
    line = geometry.LineString(path)

    # bounding boxes, with padding
    x1, y1, x2, y2 = line.bounds
    bounds = x1-radius, y1-radius, x2+radius, y2+radius

    # add to quadtree
    idx.insert(i, bounds)

    # save the line for later use
    lines.append((line, bounds))

Here's what these bounding boxes look like for the demo network:

Padded bounding boxes for edges
Padded bounding boxes for edges

Let's say we have a bus stop at the following red point:

Bus stop point
Bus stop point

To find the closest edges using the quadtree, we treat this point as a box, adding some padding to it as well:

# query point
pt = geometry.Point(0.6, 0.4)

# query box
pt_bounds = pt.x-radius, pt.y-radius, pt.x+radius, pt.y+radius
Bus stop box
Bus stop box

Querying the quadtree is simple:

matches = idx.intersect(pt_bounds)
Matched edges (in yellow)
Matched edges (in yellow)

Now it's a matter of finding the closest edge. We're using shapely's geometry objects which have a lot of what we need built-in:

# find closest path
closest_path = min(matches, key=lambda i: lines[i][0].distance(pt))
closest_line = lines[closest_path][0]
Closest edge (in red)
Closest edge (in red)

Once we have the closest edge, we find the closest point on that edge by projecting the query point onto the edge. We don't want an absolute position, but rather how far the point is along the edge. That is, we want a value in [0,1] such that 0 is the start of the edge, 1 is the end of the edge, 0.5 is halfway along the edge, etc. shapely's normalized=True argument will take care of that for us.

p = closest_line.project(pt, normalized=True)

If we want the absolute point to, for example, plot it below, we can get iteasily:

closest_pt = closest_line.interpolate(p, normalized=True)
Closest point (the red star)
Closest point (the red star)

Now we have mappings of bus stops to positions along edges, which we can leverage when computing trip routes for our transit demand model.

Three.js + Blender Game Dev

Yesterday I started working on a new game, tentatively titled "Trolley Problems: A Eulogy for Social Norms". This will be my second game with Three.js and Blender (though the first, The Founder, is still not totally finished) and I thought it'd be helpful to document this process.

The game isn't very far at this point. This video shows its current state:

The basic process was: 1) create the chair model in Blender and 2) load the chair into Three.js.


I won't give a Blender modeling tutorial here (this was suggested as a good resource to start), but the theater chair model was quickly made by extruding a bunch from a cube.

Theater chair in Blender
Theater chair in Blender

A lot of the model's character comes out in the texturing. I like to keep textures really simple, just combinations of solid colors. The way I do it is with a single texture file that I gradually add to throughout a project. Each pixel is a color, so the texture is basically a palette.

In Blender I just select the faces I want to have a particular color, use the default UV map unwrapping (while in Edit Mode, hit U and select Unwrap) and then in UV/Image Editor (with the texture file selected in the dropdown, see image below) I just drag the unwrapped faces onto the appropriate colors.

UV mapping/applying the texture. Each color block is one pixel
UV mapping/applying the texture. Each color block is one pixel

There is one thing you have to do to get this texture rendering properly. By default, Blender (like almost every other 3D program) will try to scale your textures.

In the texture properties window (select the red-and-white checker icon in the Properties window, see below), scroll down and expand the Image Sampling section. Uncheck MIP Map and Interpolation, then set Filter to Box (see below for the correct settings). This will stop Blender from trying to scale the texture and give you the nice solid color you want.

The texture properties window. Here's where you set what image to use for the texture
The texture properties window. Here's where you set what image to use for the texture
The texture image sampling settings
The texture image sampling settings

Here's a Blender render of the outcome:

Render of the theater chair
Render of the theater chair

Exporting to Three.js

One of the best things about Three.js is that there is a Blender-to-Three.js exporter. Installation is pretty straightforward (described in the repo here).

To export a Blender object to JSON, select your object (in Object Mode) and select from the menu File > Export > Three (.json).

The export options can be a bit finicky, depending on what you want to export the model for. For a basic static model like this chair the following settings should work fine:

Blender-to-Three.js export settings
Blender-to-Three.js export settings

That's all that Blender's needed for.


This next section involves a bunch of code. I won't reproduce everything here (you can check out the repo to see the full working example) but I'll highlight the important parts.

Loading the model

Three.js provides a JSONLoader class which is what loads the exported Blender model. You could just use that and be done with it, but there are a few modifications I make to the loaded model to get it looking right.

const meshLoader = new THREE.JSONLoader();
meshLoader.load('assets/theater_chair.json', function(geometry, materials) {
    // you could just directly use the materials loaded from JSON,
    // but I like to make sure I'm using the proper material.
    // we extract the texture from the loaded material and pass it to
    // the MeshLambertMaterial here
    var texture  = materials[0].map,
        material = new THREE.MeshLambertMaterial({map: texture}),
        mesh = new THREE.Mesh(geometry, material);

    // material adjustments to get the right look
    material.shininess = 0;
    material.shading = THREE.FlatShading;

    // basically what we did with blender to prevent texture scaling = false; = THREE.NearestFilter; = THREE.NearestFilter;

    // increase saturation/brightness
    material.emissiveIntensity = 1;
    material.emissive = {r: 0, g: 0, b: 0};
    material.color = {
        r: 2.5,
        g: 2.5,
        b: 2.5


The above code won't work until we create the scene. I like to bundle the scene-related stuff together into a Scene class:

class Scene {
    constructor(lightColor, groundColor, skyColor, fogColor, fogDistance) { = new THREE.PerspectiveCamera(75, window.innerWidth / window.innerHeight, 1, 1000);
        this.scene = new THREE.Scene();
        this.scene.fog = new THREE.Fog(fogColor, 0, fogDistance);
        var light = new THREE.HemisphereLight(lightColor, 0x777788, 0.75);
        light.position.set(0.5, 1, 0.75);

        // setup floor
        var geometry = new THREE.PlaneGeometry(2000, 2000, 100, 100);
        var material = new THREE.MeshLambertMaterial({color: groundColor});
        var mesh = new THREE.Mesh(geometry, material);

        this.renderer = new THREE.WebGLRenderer();
        this.renderer.setSize(window.innerWidth, window.innerHeight);

        window.addEventListener('resize', this.onWindowResize.bind(this), false);

    render() {

    add(mesh) {

    onWindowResize() { = window.innerWidth/window.innerHeight;;
        this.renderer.setSize(window.innerWidth, window.innerHeight);

It can be used like this:

var scene = new Scene();

And the previous code for loading the chair should place the chair into the scene.

First-person interaction

So far you won't be able to look or walk around the scene, so we need to add some first-person interaction. There are two components to this:

  1. movement (when you press WASD or the arrow keys, you move forward/backwards/left/right)
  2. looking around (when you move the mouse)


In the Scene class the onKeyDown and onKeyUp methods determine, based on what keys you press and release, which direction you should move in. The render method includes some additional code to check which directions you're supposed to be moving in and adds the appropriate velocity.

The velocity x value moves you right (positive) and negative (negative), the y value moves you up (positive) and down (negative), and the z value moves you forward (negative) and backwards (positive). It's important to note that the z value is negative in the forward direction because this confused me for a while.

We also keep track of how much time elapsed since the last frame (delta) so we scale the velocity appropriately (e.g. if the last frame update was 0.5s ago, you should move only half as far as you would if it had been 1s ago).

You'll notice that you can walk through objects which is probably not what you want...we'll add simple collision detection later to fix this.

Looking around

The key to looking around is the browser's pointer lock API. The pointer lock API allows you to capture your mouse cursor and its movement.

I'd never done this before, but the Three.js repo includes an example that shows the basic method. So I gutted that and moved the important bits into the Scene class. The full code is available here), but I'll explain some of it here.

In the Scene class the important method is setupPointerLock, which sets up the pointer lock event listeners. It is pretty straightforward, but basically: there's an instructions element that, when clicked on, locks the pointer and puts the game into fullscreen.

The onPointerLockChange method toggles the pointer lock controls, so that the controls are only listening when the pointer lock is engaged.

The meat of the actual pointer movement is handled in PointerLock.js. This is directly lifted from the Three.js example implementation. It's also pretty sparse; it adjusts the yaw and pitch of the camera according to how you move your mouse.

We have to properly initialize these controls though. In the Scene's constructor the following are added:

    // ...
    this.controls = new THREE.PointerLockControls(;
    // ...

Collision detection

So the last bit here is to prevent the player from walking through stuff. I have a terrible intuition about 3D graphics so this took me way too long. Below are some of my scribbles from trying to understand the problem.


The basic approach is to use raycasting, which involves "casting" a "ray" out from a point in some direction. Then you can check if this ray intersects with any objects and how far away those objects are.

Below are an example of some rays. For example, the one going in front of the object points to (0,0,1). That sounds like it contradicts what I said earlier about the front of the object being in the negative-z doesn't quite. This will become important and confusing later.

Some of the casted rays (in blue)
Some of the casted rays (in blue)

The rays that I'm using are:

const RAYS = [
  new THREE.Vector3( 0,  0, -1), // back
  new THREE.Vector3( 1,  0, -1), // back-right
  new THREE.Vector3(-1,  0, -1), // back-left
  new THREE.Vector3( 0,  0,  1), // forward
  new THREE.Vector3( 1,  0,  1), // forward-right
  new THREE.Vector3(-1,  0,  1), // forward-left
  new THREE.Vector3( 1,  0,  0), // right
  new THREE.Vector3(-1,  0,  0), // left
  new THREE.Vector3( 0, -1,  0), // bottom

Note that the comments in the example on GitHub are incorrect (they have right and left I said, this was very confusing for me).

Every update we cast these rays and see if they intersect with any objects. We check if those objects are within some collision distance, and if they are, we zero out the velocity in that direction. So, for instance, if you're trying to move in the negative-z direction (forward) and there's an object in front of you, we have to set velocity.z = 0 to stop you moving in that direction.

That sounds pretty straightforward but there's one catch - the velocity is relative to where you're facing (i.e. the player's axis), e.g. if velocity.z is -1 that means you're moving in the direction you're looking at, which might not be the "true" world negative-z direction.

These rays, unlike velocity, are cast in directions relative to the world axis.

This might be clearer with an example.

Say you're facing in the positive-x direction (i.e. to the right). When you move forward (i.e. press W), velocity.z will be some negative value and velocity.x will be zero (we'll say your velocity is (0,0,-1)). This is because according to your orientation, "forward" is always negative-z, even though in the context of the world your "forward" is technically positive-x. Your positive-x (your right) is in the world's negative-z direction (see how this is confusing??).

Now let's say an object is in front of you. Because our raycasters work based on the world context, it will say there's an obstacle in the positive-x direction. We want to zero-out the velocity in that direction (so you're blocked from moving in that direction). However, we can't just zero-out velocity.x because that does not actually correspond to the direction you're moving in. In this particular example we need to zero-out velocity.z because, from your perspective, the obstacle is in front of you (negative-z direction), not to your right (positive-x direction).

Orientation problem example
Orientation problem example

The general approach I took (and I'm not sure if it's particularly robust, but it seems ok for now) is to take your ("local") velocity, translate it to the world context (i.e. from our example, a velocity of (0,0,-1) gets translated to (1,0,0)). Then I check the raycasters, apply the velocity zeroing-out to this transformed ("world") velocity (so if there is an obstacle in the positive-x direction, I zero out the x value to get a world velocity of (0,0,0)), then re-translate it back into the local velocity.

Ok, so here's how this ends up getting implemented.

First I add the following initialization code to the Scene's constructor:

    // ...
    this.height = 10;
    this.collidable = [];
    this.collisionDistance = 15;
    this.raycaster = new THREE.Raycaster(new THREE.Vector3(), new THREE.Vector3(0, -1, 0), 0, this.height);
    // ...

Whenever you add a mesh and the player shouldn't be able to walk through it, you need to add that mesh to this.collidable.

Then I add a detectCollisions method onto the Scene class:

detectCollisions() {
    // the quaternion describes the rotation of the player
    var quaternion = this.controls.getObject().quaternion,
        adjVel = this.velocity.clone();

    // don't forget to flip that z-axis so that forward becomes positive again
    adjVel.z *= -1;

    // we detect collisions about halfway up the player's height
    // if an object is positioned below or above this height (and is too small to cross it)
    // it will NOT be collided with
    var pos = this.controls.getObject().position.clone();
    pos.y -= this.height/2;

    // to get the world velocity, apply the _inverse_ of the player's rotation
    // effectively "undoing" it
    var worldVelocity = adjVel.applyQuaternion(quaternion.clone().inverse());

    // then, for each ray
    _.each(RAYS, (ray, i) => {
      // set the raycaster to start from the player and go in the direction of the ray
      this.raycaster.set(pos, ray);

      // check if we collide with anything
      var collisions = this.raycaster.intersectObjects(this.collidable);
      if (collisions.length > 0 && collisions[0].distance <= this.collisionDistance) {
        switch (i) {
          case 0:
            // console.log('object in true back');
            if (worldVelocity.z > 0) worldVelocity.z = 0;
          case 1:
            // console.log('object in true back-left');
            if (worldVelocity.z > 0) worldVelocity.z = 0;
            if (worldVelocity.x > 0) worldVelocity.x = 0;
          case 2:
            // console.log('object in true back-right');
            if (worldVelocity.z > 0) worldVelocity.z = 0;
            if (worldVelocity.x < 0) worldVelocity.x = 0;
          case 3:
            // console.log('object in true front');
            if (worldVelocity.z < 0) worldVelocity.z = 0;
          case 4:
            // console.log('object in true front-left');
            if (worldVelocity.z < 0) worldVelocity.z = 0;
            if (worldVelocity.x > 0) worldVelocity.x = 0;
          case 5:
            // console.log('object in true front-right');
            if (worldVelocity.z < 0) worldVelocity.z = 0;
            if (worldVelocity.x < 0) worldVelocity.x = 0;
          case 6:
            // console.log('object in true left');
            if (worldVelocity.x > 0) worldVelocity.x = 0;
          case 7:
            // console.log('object in true right');
            if (worldVelocity.x < 0) worldVelocity.x = 0;
          case 8:
            // console.log('object in true bottom');
            if (worldVelocity.y < 0) worldVelocity.y = 0;

    // re-apply the player's rotation and re-flip the z-axis
    // so the velocity is relative to the player again
    this.velocity = worldVelocity.applyQuaternion(quaternion);
    this.velocity.z *= -1;

This is working for me so far. The code can probably be more concise and I suspect there's a much more direct route (involving matrix multiplication or something) to accomplish this. But I can wrap my head around this approach and it makes sense :)

Simulating a multi-node (Py)Spark cluster using Docker


I'm working on a set of tools for the Coral Project to make building data analysis pipelines easy and, perhaps one day, accessible to even non-technical folks. Part of what will be offered is a way of easily toggling between running pipelines on a in parallel on a local machine or on a distributed computing cluster. That way, the pipelines that a small organization uses for their data can be adapted to a larger organization just by spinning up the setup described below and changing a configuration option.

I wanted to simulate a multi-node cluster for developing these tools, and couldn't find any guides for doing so. So after some research, here is one.

The setup that follows runs all on one machine (remember, it just simulates a multi-node cluster), but it should be easily adaptable to a real multi-node cluster by appropriately changing the IPs that the containers use the communicate.

I have made available a repo with the Dockerfiles and scripts described below.

The Stack

A lot goes into the cluster stack:

  • Spark - used to define tasks
  • Mesos - used for cluster management
  • Zookeeper - used for electing Mesos leaders
  • Hadoop - used for HDFS (Hadoop Distributed File System)
  • Docker - for containerizing the above

The Setup

There will be a client machine (or "control node"), which is the machine we're working from. In this walkthrough, the client machine also functions as the Docker host (where the Docker containers are run).

Docker containers are spun up for each other part of the stack, and they all communicate via their "external" Docker IPs.

Setting up the client

I'm assuming a Linux environment because that's what Docker works best with (on OSX you are probably running it in a Linux VM anyways). The following instructions are for Ubuntu but should be replicable on other distros.

The client needs to have Spark and Mesos installed to properly interact with the cluster.

Spark has precompiled binaries available on their downloads page which are easily installed:

# go to <>
# select and download the version you want
tar -xzvf spark-*.tgz
rm spark-*.tgz
sudo mv spark* /usr/local/share/spark

Add the following to your ~/.bash_profile as well:

export SPARK_HOME=/usr/local/share/spark

# so pyspark can be imported in python

PySpark has one final requirement, the py4j library:

pip install py4j

Mesos does not have any precompiled binaries, so you must compile it yourself:


# sources available at <>
tar -zxf mesos-*.tar.gz
rm mesos-*.tar.gz

# dependencies
sudo apt-get install -y openjdk-7-jdk build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev

# by default, this installs to /usr/local
cd mesos*
mkdir build
cd build
sudo make install

Finally, we need to configure Spark to use a Mesos cluster:

cp $SPARK_HOME/conf/ $SPARK_HOME/conf/
echo 'export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/' >> $SPARK_HOME/conf/

That's all for the client.

Setting up the Docker images

Our "cluster" will consist of several Docker containers, with one (or more) for each part of the stack, so we create images for each.


The Zookeeper image is straightforward:

FROM ubuntu:14.04

ENV ZOOKEEPER_PATH /usr/local/share/zookeeper

# update
RUN apt-get update
RUN apt-get upgrade -y

# dependencies
RUN apt-get install -y wget openjdk-7-jre-headless

# zookeeper
RUN wget${ZOOKEEPER_V}/zookeeper-${ZOOKEEPER_V}.tar.gz
RUN tar -zxf zookeeper-*.tar.gz
RUN rm zookeeper-*.tar.gz
RUN mv zookeeper-* $ZOOKEEPER_PATH
RUN mv $ZOOKEEPER_PATH/conf/zoo_sample.cfg $ZOOKEEPER_PATH/conf/zoo.cfg



CMD ["start-foreground"]

A Zookeeper binary is downloaded and installed, then the default config is copied over. We start the Zookeeper service in the foreground so the Docker container does not immediately exit.


The Hadoop image is more involved:

FROM ubuntu:14.04

ENV HADOOP_HOME /usr/local/hadoop
ENV JAVA_HOME /usr/lib/jvm/java-7-openjdk-amd64
ENV HADOOP_TMP /var/hadoop/tmp

# update
RUN apt-get update
RUN apt-get upgrade -y

# dependencies
RUN apt-get install -y openssh-server openjdk-7-jdk wget

# disable ipv6 since hadoop does not support it
RUN echo 'net.ipv6.conf.all.disable_ipv6 = 1' >> /etc/sysctl.conf
RUN echo 'net.ipv6.conf.default.disable_ipv6 = 1' >> /etc/sysctl.conf
RUN echo 'net.ipv6.conf.lo.disable_ipv6 = 1' >> /etc/sysctl.conf

# hadoop
RUN wget${HADOOP_V}/hadoop-${HADOOP_V}.tar.gz
RUN tar -zxf hadoop-*.tar.gz
RUN rm hadoop-*.tar.gz
RUN mv hadoop-* $HADOOP_HOME

# hadoop tmp directory
RUN mkdir -p $HADOOP_TMP
RUN chmod 750 $HADOOP_TMP

# configs
RUN echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_HOME/etc/hadoop/
ADD docker/assets/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml

# auth
# the provided config saves us from having
# to accept each new host key on connect
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/ /root/.ssh/authorized_keys
ADD docker/assets/ssh_config /root/.ssh/config

# format the hdfs
RUN hdfs namenode -format

ADD docker/assets/start_hadoop start_hadoop

EXPOSE 8020 50010 50020 50070 50075 50090

CMD ["-d"]
ENTRYPOINT ["./start_hadoop"]

It does the following:

  • A Hadoop binary is downloaded and installed
  • IPV6 is disabled because Hadoop does not support it
  • SSH auth is setup because Hadoop uses it for connections
  • Hadoop is configured with the proper Java install

For SSH, a config which frees us from having to manually accept new hosts is copied over:

Host *
    UserKnownHostsFile /dev/null
    StrictHostKeyChecking no

A core-site.xml config file is also added, which includes:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <description>A base for other temporary directories.</description>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>

The important part here is the fs.defaultFS property which describes how others can access the HDFS. Here, the value is localhost, but that is replaced by the start_hadoop script (see below) with the container's "external" IP.

And finally, a start_hadoop script is copied over, which includes:


# get "external" docker ip
HDFS_IP=$(ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{print $1}')

# set the proper ip in the HDFS config
sed -i 's/localhost/'${HDFS_IP}'/g' $HADOOP_HOME/etc/hadoop/core-site.xml

/etc/init.d/ssh restart

if [[ $1 == "-d" ]]; then
    while true; do sleep 1000; done

if [[ $1 == "-bash" ]]; then

As mentioned, it replaces the localhost value in the core-site.xml config with the "external" IP so that others can connect to the HDFS.

It also starts the SSH service, then starts the HDFS, and, with the -d flag (which is passed in the above Dockerfile), emulates a foreground service so that the Docker container does not exit.


For the Mesos leader and followers, we first create a base Mesos image and then use that to create the leader and follower images.

The base Mesos image Dockerfile:

FROM ubuntu:14.04

ENV MESOS_V 0.24.0

# update
RUN apt-get update
RUN apt-get upgrade -y

# dependencies
RUN apt-get install -y wget openjdk-7-jdk build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev

# mesos
RUN wget${MESOS_V}/mesos-${MESOS_V}.tar.gz
RUN tar -zxf mesos-*.tar.gz
RUN rm mesos-*.tar.gz
RUN mv mesos-* mesos
RUN mkdir build
RUN ./configure
RUN make
RUN make install

RUN ldconfig

This just builds and installs Mesos.

The leader Dockerfile:

FROM mesos_base
ADD docker/assets/start_leader start_leader
ENTRYPOINT ["./start_leader"]

It exposes the Mesos leader port and copies over a start_leader script, which contains:


# get "external" docker IP
LEADER_IP=$(ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{print $1}')
mesos-master --registry=in_memory --ip=${LEADER_IP} --zk=zk://${ZOOKEEPER}/mesos

All this does is tell the leader to use its "external" IP, which is necessary so that the Mesos followers and the client can properly communicate with it.

It also requires a ZOOKEEPER env variable to be set; it is specified when the Docker container is run (see below).

The follower Dockerfile:

FROM mesos_base

ADD docker/assets/start_follower start_follower


# permissions fix

# use python3 for pyspark
RUN apt-get install python3
ENV PYSPARK_PYTHON /usr/bin/python3

ENTRYPOINT ["./start_follower"]

There is a bit more going on here. The Mesos follower port is exposed and a few env variables are set. The MESOS_SWITCH_USER variable is a fix for a permissions issue, and the PYSPARK_PYTHON lets Spark know that we will use Python 3.

Like the leader image, there is a start_follower script here, which is simple:

mesos-slave --master=zk://${ZOOKEEPER}/mesos

Again, it uses a ZOOKEEPER env variable which is specified when the container is run.

Building the images

Finally, we can build all the images:

sudo docker build -f Dockerfile.mesos -t mesos_base .
sudo docker build -f Dockerfile.follower -t mesos_follower .
sudo docker build -f Dockerfile.leader -t mesos_leader .
sudo docker build -f Dockerfile.zookeeper -t mesos_zookeeper .
sudo docker build -f Dockerfile.hadoop -t hadoop .

Running the cluster

With all the images built, we can start the necessary Docker containers.

First, start a Zookeeper container:

sudo docker run --name mesos_zookeeper -itP mesos_zookeeper

When it's running, make a note of its IP:

ZOOKEEPER_IP=$(sudo docker inspect --format '{{.NetworkSettings.IPAddress }}' $(sudo docker ps -aq --filter=name=mesos_zookeeper))

Then, start the Hadoop container:

sudo docker run --name hadoop -itP hadoop

Note that our container name here should not have underscores in it, because Java can't handle hostnames with underscores.

Then, start a Mesos leader container:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader -itP mesos_leader

Note that we set the ZOOKEEPER env variable here.

Finally, start a Mesos follower container in the same fashion:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower -itP mesos_follower

Using the cluster

With the client setup and the cluster containers running, we can start using PySpark from the client machine.

We'll do the classic word count example to demonstrate the process.

First, open a shell in the Hadoop container:

sudo docker exec -it hadoop bash

From this container, grab a text file to work with and put it in the HDFS so the Mesos followers can access it:

hadoop fs -put pg4300.txt /sample.txt

Now, back in the client machine, we can put together a Python script to count the words in this file.

First, we need to know the Zookeeper host, so PySpark knows where to find the cluster, and the Hadoop IP, so PySpark knows where to grab the file from. We'll pass them in as command-line arguments and grab them using the sys library:

import sys
import pyspark

zookeeper = sys.argv[1]
hadoop_ip = sys.argv[2]

Then we can specify where to find the text:

src = 'hdfs://{}:8020/sample.txt'.format(hadoop_ip)

And configure PySpark:

conf = pyspark.SparkConf()

One important configuration option is spark.executor.uri, which tells Mesos followers where they can get the Spark binary to properly execute the tasks. This must be a prebuilt Spark archive, i.e. a Spark binary package. You can build it and host it yourself if you like.

conf.set('spark.executor.uri', '')

Then we can create the SparkContext with our config and define the task:

sc = pyspark.SparkContext(conf=conf)

lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(' '))
word_count = ( x: (x, 1)).reduceByKey(lambda x, y: x+y))

Save this file as

There are a couple gotchas when running this script.

We cannot run it with a simple python If we do so, then PySpark will use the client's local IP, e.g. something like We want PySpark to use the client's Docker IP so that it can properly communicate with the other Docker containers, and specify this as an env variable called LIBPROCESS_IP:

export LIBPROCESS_IP=$(ifconfig docker0 | grep 'inet addr:' | cut -d: -f2 | awk '{print $1}')

Then, we must also specify the proper Python version for the client's Spark install:

export PYSPARK_PYTHON=/usr/bin/python3

Because we're also passing in the Zookeeper connection string and the Hadoop IP, let's get those too:

HADOOP_IP=$(sudo docker inspect --format '{{.NetworkSettings.IPAddress }}' $(sudo docker ps -aq --filter=name=hadoop))

And now we can run the script:


Multi-node/high-availability setup

So far we only have one follower, but to better emulate a multi-node setup, we want many followers. This is easy to do, just spin up more follower Docker containers with the proper ZOOKEEPER variable:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower1 -itP mesos_follower
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower2 -itP mesos_follower
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower3 -itP mesos_follower
# etc

For a high-availability setup, we can also create many leaders in a similar way:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader1 -itP mesos_leader
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader2 -itP mesos_leader
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader3 -itP mesos_leader
# etc

These leaders will all register with Zookeeper and Zookeeper will elect one to be the "active" leader. The followers will coordinate with Zookeeper to figure out which leader they should be talking to. If one leader goes down, Zookeeper will elect a new active leader in its place.

We can even have multiple Zookeeper containers, but I haven't yet tried it out.


This repo has all of the files mentioned with a script that makes it easy to spin up this entire setup.