Culture: A Social Network Simulator

This is a proposal for Culture, a social network simulator designed and developed to teach students about bot development.

This proposal was originally developed for a class on "news bots" I was scheduled to teach in the fall of 2017 (I ended up having a conflict and was unable to teach it). I wanted students to not only explore the impact of bots from a theory perspective, but also engage hands-on to see just how radically influential these bots are on social media platforms.

And not only bots. Ideally students would take on the role of other actors in social media ecosystems, such as a "traditional" media publication, or as an advertiser, or as a political candidate, or as a influencer, or even as the platform itself, making decisions around aspects such as the newsfeed algorithm.

Unfortunately, there are a number of challenges that make hands-on experience infeasible with live social networks:

  • Ethical concerns. For example, many bots are meant to deceive and manipulate, and we'd be working with real user data.
  • Issues of access. For example, rate-limiting and limited access to data. For privacy reasons APIs generally don't provide sensitive user data to developers, though some such data may be provided to advertisers. And of course, with live social networks there isn't a way for students to change the newsfeed algorithms for the entire network.
  • Limits of reality. For example, a student can't magically become an influencer on Twitter, but in a simulated setting, they can.

There are also some technical obstacles, namely that students taking the class weren't required to have any programming background and I didn't want to spend too much time on introductory programming lessons. Even if students were fairly experienced in programming, working with bots has a lot of advanced challenges, such as dealing with natural language. A simulated social network can be simplified so that these problems are easier to deal with.

This proposal doesn't really have a strong advertising component. After speaking with Irwin Chen about it, I realized it's a pretty big omission. So an updated proposal will include all of that: selling ads, ad targeting, ad exchanges, etc. It's not an area I know well, so I'd have to speak with some people and do some research before sketching that out.


Culture will be an agent-based simulation of a simple social network modeled off of Twitter. As such, the simulation will consist of the following (each part is elaborated further below):

  • users communicate in a rudimentary language
  • users have different personalities
  • each user will have a feed of messages from people they follow and include promoted/ad messages
    • here students can potentially design their own news feed algorithms and see how that affects individual/public opinion
  • users can message, post media, block, be blocked, be banned, follow, unfollow
  • messages and media influence users
  • the network responds to and affects outside events


So much of our exposure to and understanding of the world beyond our immediate experience is mediated by social networks, which is to say by newsfeed algorithms and other individual users of these networks. Students should develop a stronger literacy in these dynamics if they are to adequately navigate this information ecology.

This literacy is best developed by direct interaction with these social networks, such as Twitter or Facebook, rather than through theory alone. However, working directly with these networks may be impractical in that they are massive, closed-source, and limited in access. For instance, due to API limits it is impossible to survey or conduct analyses of the entire population of the network, or to examine in detail its inner operations.

Furthermore, there is no room for counterfactual speculation in these existing social networks. For example, we can't intervene and change the behaviors of all users and see how information propagation changes as a result. This limits the pedagogical value of working directly with, for example, Twitter or Facebook.

A simulated social network addresses these concerns. It can be designed to model the dynamics of its real counterparts, it can be entirely open in that students have access to all the network's data, and its parameters can be tweaked to see how information propagation evolves under different circumstances. Students can develop bots on this network without worrying about API limits, spam protection, and so on. In contrast to the black-box nature of a real social network, a simulated social network functions more like a sandbox.


The simulated agents are individual users of the social network. They are randomly generated to have particular personalities and interests (see below). Their generation is part of the simulation's initialization. Students do not directly interact with these agents, but can indirectly interact with them via, for example, ads and bots they create (see below).


Dealing with natural language is difficult even for experienced developers and advanced researchers in the topic. Broadly, the problem of natural language in the context of bots can described in two parts: understanding and generation. Both are very difficult and beyond the scope of the courses that this simulation is designed for, which includes introductory classes.

To avoid dealing with natural language, the simulation will consist of a very basic grammar and a relatively small vocabulary which can be easily expanded as needed. Because of its relatively simplicity, the same natural language processing techniques that are currently used for "real" languages can also be applied, but with greater success, and better yet, simpler heuristics will go a longer way. Thus students will not need to have a deep understanding of, for example, word vectors or TF-IDF, but may develop their own simpler techniques that will still be effective.

This simpler language will consist of verbs, nouns, and modifiers (adjectives and adverbs) (collectively, "terms"). Because the courses are assumed to be taught in English, this language will be reflective of English.

These terms are combined into formal propositional statements, e.g. single-payer-healthcare + country -> < freedom, which expresses the opinion that implementing single payer health care in this country will cause (->) a loss (<) of freedom. (This is just a sketch of the syntax; it's subject to change).

This is a bit limiting; there is no room for poetics, for instance, but will provide a strong starting point that can be expanded on later.

The design of this language will involve developing a network of terms (i.e. defining term associations), such that terms represent mixtures of other terms and values in the simulation (e.g. individuality/collectivism, see "Personalities" below). This term association network is opaque to the students; they do not get to see what these terms mean to the agents in the simulation. As with the real world, they must use algorithms or their own intuition from observing the network to determine what language best communicates their messages.

For example: the term "car" may be connected to the terms "individuality" and "freedom" to establish that the term "car" symbolically evokes these two ideas. We could then imagine ads for "cars" appeal more to agents with personalities that align more with those concepts relative to agents who, for example, align more with "collectivity" and "freedom".

Terms also have sentiment valences, e.g. "bad" may have a valence of -0.5 to express a negative opinion, whereas "terrible" may have a stronger valence of -0.8, and so on.

Ideally, this term association network is not objective but rather subjective; i.e. differs depending on the particular agent. For example, the term "freedom" may be associated with different values for one agent than for another. However, it is likely that this will be computationally infeasible (though some kind of heuristics could be developed to simplify it).

This term association network also changes over time as terms are used in slightly different contexts. This provides a way for the meaning of terms to change or be entirely inverted, e.g. a negative term being co-opted as a positive identifying term for a group.

The language is the part of the simulation that will require most care in designing - it needs to represent important aspects of how language is used in social networks (e.g. to express opinion/judgement, to harass/abuse, to make propositional statements, etc).


Simulated agents will have could loosely be described as "personalities"; that is, a set of parameters that determines how the agent interacts with others (e.g. aggressiveness/friendliness, within-bubble/outside-bubble, etc) and what their values are (e.g. conservative/progressive, individualist/collectivist, etc). These personalities will be generated randomly, via a Bayes Net (or some similar probabilistic model) that will be editable in some way. A model like a Bayes Net lets us describe assumed relationships between values (e.g. more collectivist agents are more likely to be friendly).

These personalities also determine who agents tend to interact with (under principles of homophily, i.e. like attracts like) and also what kind of messaging resonates with them (e.g. messages about rugged individuality will resonate more with individualist agents).


"Messages" are the equivalent of Twitter's tweets. Agents compose their own messages based on their personalities and who they are interacting with. Messages may affect an agent's mood and also their personality (see below).


Text is not the only important part of a social network - memes and other media (news stories, videos, etc) form a crucial part of their information flow.


Agents' states include their personalities, in addition to other attributes like mood and use frequency (how often they visit the social network) and post frequency (how often they post messages). Mood may affect, for example, how agents interact with other agents (e.g. with more or less hostility). This can be used to model emotion contagion.


Based on who they interact with and what other messaging they are exposed to (e.g. targeted ads), the personalities (traits/opinions) of an agent may shift over time. Various social phenomena, e.g. bipolarization, can be modeled here.


Social networks are not closed systems; they do not exist in isolation. The "outside" world affects what goes on in network, just as what goes in the network can spill out and effect the outside world.

Part of the simulation will support external events (also simulated) that affect and can be affected by the social network, such as an election. The outside event(s) affect what are popular topics (i.e. topics that are relevant and agents are more likely to talk about and respond to) and they can be defined to have some relationship to the shape of discourse in the network.

Social Network

In this section, "social network" is used not to refer to the platform itself, but to the actual network of relationships between users (expressed by "following" relationships). Some users may be highly connected (many followers), and students may, for example, as part of their strategy (whether for ads or opinion influence) try to target these opinion leaders.


It will likely be too computationally taxing to display all activity on the social network, but students will have access to various views that provide summaries (i.e. mean sentiment towards some topic, number of users talking about a topic, etc). Ideally an API can be provided like a real social network, so that students can build their own visualizations as part of their bot development process, but this may be limited by the size of the simulation.


A simple API will be provided for students to develop their own bots that interact with this network. These bots can follow, be followed, message, etc like agents can and will be the primary way students interact with the social network.

What distinguishes bots from simulated agents is that bots are designed and controlled by students, whereas the simulated agents represent "real" users of the network.

Learning Objectives

The network functions as a simplified social landscape for students to understand how ads, bots, and news feed algorithms affect opinion, trends, and discussion on a social network, and how that links up with broader spheres of discourse outside of the network. Some students may, for example, design bots that influence opinion in a certain direction, while others may design bots to influence opinion in a different direction, while still others may design bots that root out these interfering bots. Depending on how the network is designed, some students can be the managers of the social network platform.

The goal is for students to develop a comprehensive mental model about the dynamics of social media and communication in the internet age, to peek "behind the curtain" and develop a critical perspective when using social media and reading the news (i.e. develop social media literacy).


In theory this simulated social network can be extended with features that could be present on any social network, such as anonymous accounts, different kinds of blocking and muting functionality, and so on. Thus it can also be a place where students can experiment with new features to see how that affects dynamics on the network.


Ideally the simulator accommodates students who are comfortable with programming and those who aren't.

For students who aren't, bot templates could be provided which require little to no programming experience, or another layer can be developed where they "purchase" different bot, marketing, and so on services that run automatically.

If there are multiple classes going on, they can all work from the same simulation and take on different roles. If one class is focused on advertising, they can take on roles of the advertising ecosystem, while in another class perhaps they collectively take on the role of the platform. The potential for cross-class interactivity is exciting.

7x7 Cutting Room Floor

I was fortunate enough to participate in this year's edition of Rhizome's Seven on Seven with Sean Raspet. The event pairs an artist and a technologist and gives some limited time for the pair to come up with and implement a concept or project. In previous editions pairs only had a day or so; this time we had about a month.

It was enough time to churn through several ideas that never made it to the final presentation. We landed on producing a white paper proposing leveraging blockchain-based distributed computing to collectively simulate a complete human cell at the atomic level, starting with something relatively simple like a red blood cell. A human cell might have hundreds of trillions of atoms and so simulating one at the atomic resolution is basically infeasible with existing computational resources. But it is more feasible now than it was maybe a decade ago.

Through our research we came across some staggering statistics about the computing power of the Bitcoin network, namely that it is estimated to have an aggregate computing power of 80.7 zettaFLOPS (80.7 million petaFLOPS) as of May 2018. The world's reigning supercomputer, the Sunway TaihuLight, has a theoretical peak of 125 petaFLOPS. The Folding@Home network, which enables people to donate spare computing power for protein folding simulations, had an aggregate power of about 100 petaFLOPS in January 2018. Not bad for a volunteer distributed network, but still far off from the Bitcoin network. There are more details in the white paper, but those numbers stuck out.

Anyways, we went through a few ideas before we landed on this white paper. Our first focus was on the phosphorus commodity market in relationship to "peak phosphorus". This was something Sean had been researching for some time now, and for the past few months I've been poking around the agri-tech scene, so I was naturally drawn to it as a topic. The gist is that phosphorus is a mineral crucial to agriculture, a key component in fertilizers (along with nitrogen and potassium; the history of nitrogen fertilizer is very interesting and troubling one), and is basically a non-renewable resource (some can be recovered from waste but I'm not sure what percentage of it is recoverable). At some point in the relatively near future phosphorus extraction may become too expensive or difficult and that could lead to some serious food security crises. So we were thinking of various ways to represent this issue. Here are a few ideas we played around with.

Global phosphorus simulation

Phosphorus, like any resource-extractive industry, is global. We wanted to be able to convey geopolitical issues like Morocco's occupation of Western Sahara, which is where Morocco mines its phosphorus. The most straightforward way to do something like that is a 4X-style global simulation, so we played around with that first.

I designed a little framework for laying out a hex-based map (similar to the cartog library I created for my Simulation & Cybernetics class, but I wanted to support 3D):

3D hex-based maps
3D hex-based maps

That's about as far as we got in terms of implementation. But the general idea was that we'd model the dynamics of the global phosphorus market, with some shocks and random events, and projections of changes in relevant indicators like growth rates, meat consumption rates, and so on. And somehow you'd see these effects on this map and through changes in the price of commodity phosphorus.

I didn't want this hex map to be the only "output" of the simulation. We wanted to show that the macro-level dynamics of the phosphorus market are intimately connected to the health of individual plants, and so I wanted to setup an automated growing system as a more material visualization. The system would be hydroponic or aeroponic, with a phosphorus nutrient pump that releases more or less phosphorus depending on its simulated price. As peak phosphorus approaches, the plant's health starts to deteriorate as it manifests symptoms of phosphorus deficiency. There were a few issues here, namely that 1) it's a pretty big task to set up such a growing system, and 2) the changes in the plant's health would happen over long time scales relative to the simulation (e.g. one simulation year might run in one real minute, and the impacts on the plant's health might not be visible for a few real days).

Phosphorus deficiency in corn
Phosphorus deficiency in corn

A build on this idea we considered is that we'd reserve some set amount of funds for the plant, and it would actually have to "purchase" phosphorus from the nutrient reservoir on its own.

Commodity traders vs food consumers

For awhile I've wanted to make an asymmetric game which consists of two separate games that are at first glance unrelated. For example, on one side of the room is a relatively innocuous-looking life simulator game where you have to e.g. buy a house and care for your family. On the other side of the room is a stock market game where you just try to earn the highest return on your investments. What isn't apparent at first is that the actions of the player in the stock market game directly affect how difficult the life-simulator game is, for example, by triggering financial crises or affecting house prices.

We briefly considered doing something along these lines. The idea was that when we presented, we'd direct audience members to a website where they could join our phosphorus game. Some audience members would be redirected to the "commodity trader" version of the game, while others would instead be redirected to the "food consumer" version.

The commodity trader game is basically same as the stock market game, except just for phosphorus trading.

Commodity trader interface
Commodity trader interface

The food consumer game is built around a "basket", like a simplified version of a consumer price index focused on products especially affected by phosphorus prices. As a player you'd have some nutritional requirements to meet or some other purchasing obligations and some weekly budget with which to buy food. We didn't really get far enough to thoroughly think through the mechanics.

Food consumer interface
Food consumer interface

I did have a really fun time modeling the food:

Food models
Food models

Plant care Tamagotchi

Riffing off the plant-as-visualization idea, we also toyed around with the idea of some kind of plant-tamagotchi. You'd have to manage its water and phosphorus needs by doing some sort of trading or other gameplay. I can't really remember how far we got with the design.

Plant care
Plant care

I did enjoy making this wilting animation though:

Plant wilting
Plant wilting

Physics-based food thing

I honestly can't remember what the concept was for this. The most I can recall is that we discussed a system where you could rapidly click on some food or raw material objects to create derivative objects (such as beef and milk from a cow) and that somehow we'd connect that to the relative use of phosphorus in these products. For example, a cow requires a lot of feed which requires a lot of phosphorus, which results in a less efficient phosphorus-to-calorie ratio than if you had just eaten the feed grains yourself. I think I was really just excited about making something physics-based.

Picking up objects
Picking up objects
Tapping on objects for derivatives
Tapping on objects for derivatives
Bouncing around
Bouncing around

I'll definitely use this again for a different project.

The Infinite Card Game

  • 02.28.2018
  • etc

Kira and I were in Australia most of last month, and near where we were staying in Melbourne was a game shop. We had a free Friday night so I stopped by for my first Magic: The Gathering (MTG) draft event, and it got me thinking about designing card game systems.

MTG is a collectible card game with a great deal of strategic depth. Games with large state spaces like Chess and more recently Go have been more-or-less "solved"1; The state space of MTG is certainly orders of magnitude larger than Chess and Go, given the massive back catalog of cards (going back to 1993!)2 and the ever-growing number of interactions between them. Though the state space of Starcraft is likely larger (and people are working on "solving" it), to my knowledge MTG has not yet been solved in this sense.

For those unfamiliar with MTG, it's played between two or more players and involved constructing a deck of cards around a particular strategy. Some strategies may emphasize fast, aggressive plays ("aggro") which, if failing to win quickly, lose steam in longer matches. Others may focus on slowing opponents down by stopping plays short or making actions more expensive ("control"). And there are other strategies still.

MTG has a variety of game formats which govern how decks are constructed and can affect other game rules. These formats are broadly divided into Constructed and Limited formats. Constructed formats are where players carefully design and assemble their decks in advance. This gives plenty of space for creative, expressive strategies since players have a large pool of cards to select from. In contrast, Limited formats mean that players are given a small amount of random cards drawn from a set of cards and need to assemble a deck on-the-spot (a process called "drafting").

I've mostly played Constructed formats, but now that I've tried Limited a bit more I'm coming to prefer the randomness and uncertainty of Limited formats. In Limited you have to think on your feet more, design your deck more delicately (you aren't sure what to expect from your opponents), and work within a tighter set of constraints. It makes for more challenging and exciting games.

The problem with Limited is that each set has roughly 200-300 cards. After a few games you'll be familiar with every card and players have learned the strategies that work best within that set. Games start to get formulaic and stale. It loses that sense of uncertainty that makes Limited exciting in the first place. It isn't until the next set is released, with new cards and abilities, that things are interesting again.

These sets are carefully designed such that the cards have enough variation to keep things interesting, but not so much that they're totally incoherent (Mark Rosewater, the lead designer of MTG, has a great podcast delving into this design process). And they are meticulously balanced so no strategy is strictly better than any others.

That being said...I wonder if there's a way to design an infinite set, i.e. a dynamic and self-adjusting process which outputs a stream of cards to draw from for a never-ending Limited format. Such a system would need some rule scaffolding or framework (doesn't have to be MTG's) from which it can derive new mechanics and costs (some quantifier of their power), and then generate a balance of cards over some probability distribution.

For example, a core mechanic in MTG is that you have creatures that can attack opponents to damage them. Players can use their own creatures to "block" opponents' attacking creatures. These creatures have some cost to play ("cast") them; generally stronger creatures have the drawback of costing more to cast. Sometimes they may have other abilities which make them more or less versatile, which is compensated by a respective increase or decrease in casting cost. Sometimes you have creatures which are disproportionately cheap in casting cost for their strength, but these are rare.

Let's say the game for this infinite draft system has just this simple attacking-creature mechanic. Our creatures have only a strength and a casting cost. Generally, the stronger the creature, the higher it's casting cost. But not always -- on rare occasions we might have a strong creature that's a bit cheaper than normal. Finally, we add the additional constraint that weaker creatures are more likely, so we emphasize strong creatures as a more notable event.

What we're essentially saying is that the casting cost of a creature is dependent on its strength (and vice versa), which we can represent as a simple Bayes net:

G strength strength casting_cost casting_cost strength->casting_cost

When we want to create a new card, we can first sample its strength, then sampling a casting cost depending on the value we sampled for its strength.

We want a creature's strength to be a positive integer, say in the range $[1, 12]$. So we want a discrete probability distribution with finite support. We could use a Beta-binomial distribution, e.g. $\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$, which has the properties we want:

$\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$
$\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$

Here creatures will tend to have a strength somewhere in $[1, 4]$ and very rarely above 6. Then we can do something similar with casting cost, except that it's dependent on the strength.

This is an extremely simple game and so not a very interesting one. We'd want to add in additional abilities that interact in interesting ways. For example, in MTG the "flying" ability makes a creature blockable only by other creatures with flying. So we can add in some small probability of a creature gaining flying, and have that also affect the casting cost's distribution.

MTG's Flying mechanic

A really nice version of this system is one where you can pass in an arbitrary network relating costs and abilities (a more complex example of the one above), and it would output card descriptions in some interchange form (e.g. JSON), and you can use that to print cards with whatever design you wanted.

A few years ago I prototyped a similar system of cost-based card generation, which incorporated different card types beyond creatures (which I called "units"), additional abilities, and procedurally-generated names.

An example generated card
An example generated card

This prototype doesn't incorporate intra-card balance beyond what falls out of cost-balanced cards. The relative effectiveness of various abilities are really hard to objectively quantify, since their costs are really relative to what the dominant strategies are, i.e. the metagame. So ideally this infinite draft system not only generates balanced cards but also tracks how people are playing them to learn which strategies seem over or underpowered, and correspondingly tweaks the costs of abilities related to those strategies.

The generation features I just described are more about balance but another interesting feature would be introducing a balance-drift so that gameplay never stagnates in an equilibrium of strategies. Perhaps once balance is achieved the system can gradually and temporarily bias the game towards different strategies to encourage different kinds of gameplay. That way there'd be an ebb-and-flow that keeps things interesting in the aggregate and subtly changing the overall feeling of the game.

For example, if the system sees that players almost exclusively play low-strength, cheap creatures, and almost no larger creatures, maybe it will start to slightly cheapen the larger creatures so they see more play. That in turn may cause a new strategy to dominate, maybe a slower gameplay with larger creatures, and eventually a different change would be introduced to shock the system to a different strategy equilibrium.

I've given a very hand-wavy outline of this system here and as described it by no means would match MTG's hand-designed depth and complexity. But I do like the idea of a general system where you input some mechanics and it outputs a series of cards to play a game with. You could, perhaps, model many different systems via a card game format this way.

  1. Not solved in the proper sense, but human players are reliably bested by computer players. 

  2. Based on's data, there are 18,191 unique cards as of Jan 21, 2018. 

Public transit routing

For the transit demand model for the PolicySpace project (see previously) we needed to be able to ingest GTFS (General Transit Feed Specification) data and use it to generate public transit routes. That is, given a start and an end stop, what's the quickest route using only public transit? Ideally we minimize both travel time and number of transfers, or strike some balance between the two.

I found an existing implementation here, based off of the papers Trip-Based Public Transit Routing (Sascha Witt, 2015) and Trip-Based Public Transit Routing Using Condensed Search Trees (Sascha Witt, 2016). The routing worked fine, but processing the Belo Horizonte GTFS data (630 routes with 9,428 stops, including buses and metro) took over 12 hours. The routing itself did not seem to have a significant speed gain from this pre-processing (unfortunately I don't have specific numbers on hand).

Instead I spent the past few days implementing my own public transit routing algorithm. It's a first attempt, and so it's naive/not heavily optimized, but so far seems to work alright. The goals were to significantly reduce pre-processing time, have a relatively low memory footprint, and quickly produce routes.

First, it's worth explaining a bit about how GTFS data is structured and the terminology it uses.

Public transit has a few components:

  • stops: same as the colloquial usage; a 🚏stop is somewhere a bus, train, etc stops to pick up and drop off, described with an id and a latitude and longitude.
  • trips: a 🚌trip is a sequence of stops associated with a set of arrival and departure times. For example, trip 🚌A goes through stop 🚏X at 12:00, then stop 🚏Y at 13:00, then stop 🚏Z at 14:00. Trip 🚌B goes through stop 🚏X at 13:00, then stop 🚏Y at 14:00, then stop 🚏Z at 15:00. Even though they have the same stop sequence (🚏X->🚏Y->🚏Z), they arrive/depart from each stop at different times, so they are distinct trips.
  • route: a route is a collection of trips. A route's trips do not necessarily mean they share the exact same sequence of stops. For example, a route may have a trip that goes 🚏X->🚏Y->🚏Z and another trip that only goes 🚏X->🚏Y. (I didn't use this data and it's confusing because we're using the term "route" to describe something different. This is the last you'll see this particular usage in this post.)
  • service: a service associates routes with days of operation (e.g. weekdays, weekends, off or alternative schedules on holidays, etc).

With these terms defined, we can define a public transit "route" more specifically as "a sequence of trips and their transfer stops". This is a bit more specific than how we describe routes conversationally, which is more like "take the Q line and transfer to the M at Foo Station". Instead, what we're after is more like "take the 12:42 train on the Q line, transfer to the 13:16 M at Foo Station".

GTFS data includes a few files (which are essentially CSVs but for whatever reason use the txt extension); the important ones here are:

  • stop_times.txt: describes stop sequences for trips, as well as arrival and departure times for each trip's stops.
  • stops.txt: describes stop latitudes and longitudes

We use some of the other data files as well, but those are more for juggling stop and trip ids and less core to the routing algorithm.

Transfer network: structure

The general approach is to generate a transfer network which describes at what stops are various trips linked together (i.e. where transfers can be made). The nodes of this network are described as (trip_id, stop_id) tuples. Note that not all stops of a trip will be nodes in the network; only those where transfers are made. Our resulting route only needs to return trips and where to transfer, not the individual stops within trips. That's something we can easily lookup later using the stop_times.txt data if needed.

The edges of this network can describe two different things:

  • inter-trip edges: If an edge connects two nodes which share a trip_id, e.g. (🚌A, 🚏X)->(🚌A, 🚏Y), it indicates in the network structure that these nodes are along the same trip. Here the edge weight indicates travel time between stops 🚏X and 🚏Y for trip 🚌A.
  • transfer edges: If an edge connects two nodes which don't share a trip_id, e.g. (🚌A, 🚏X)->(🚌B, 🚏X) it indicates a transfer between trips, e.g. transfer between trips 🚌A and 🚌B at stop 🚏X, and the edge weight indicates estimated transfer time.

We also need to distinguish between direct and indirect transfers:

  • a direct transfer is one where the two connected trips share a stop. For example, (🚌A, 🚏X)->(🚌B, 🚏X) is a direct transfer because these trips both share the stop 🚏X.
  • an indirect transfer is one where the two connected trips do not share a stop, but there is a reasonable footpath between the stops (i.e. you could walk between stops under some threshold transfer time limit, e.g. 5 minutes). For example (🚌A, 🚏X)->(🚌B, 🚏Y) where stop 🚏Y is around the corner from stop 🚏X. By introducing these additional edges, we potentially find shorter routes or routes where there normally wouldn't be one.

The transfer network is directed graph. This is due to the forward arrow of time. For example, trip 🚌A arrives at stop 🚏X at 12:10 and the 🚌B departs from stop 🚏X at 12:20, which indicates a valid transfer from trip 🚌A to 🚌B. However, I can't make the transfer in reverse; if I'm on trip 🚌B and arrive at 🚏X, trip 🚌A has potentially already departed from 🚏X. If the situation was that trip 🚌A lingers at 🚏X for long enough that the transfer does work in the reverse direction, we'd just have another edge (🚌B, 🚏Y)->(🚌A, 🚏X).

Transfer network: generation

The pre-processing step is almost entirely the generation of this transfer network. Once we have this network it's relatively easy to generate routes with Dijkstra's algorithm.

The generation process is broken into three parts:

  1. Generate direct transfer edges
  2. Generate indirect transfer edges
  3. Generate inter-trip edges

Direct transfer edge generation

We can use the stop_times.txt file to identify direct transfers. We group rows according to stop_id and look at the trip_ids under each group. However, we can't just create edges between all these trips here; we have to filter in two ways:

  1. down to valid transfers (those that obey the arrow of time). As noted before, a trip 🚌A can only transfer to a trip 🚌B at stop 🚏X if trip 🚌A arrives at 🚏X before trip 🚌B departs.
  2. for each equivalent trip, select only the soonest to the incoming trip.

Expanding on 2): we assume that travellers want to make the soonest transfer possible from among a set of equivalent trips. Equivalent trips are those that share the same stop sequence, regardless of specific arrival/departure times. For example: a trip 🚌A with stops 🚏X->🚏Y->🚏Z and a trip 🚌B with stops 🚏X->🚏Y->🚏Z share the same stop sequence, and so are equivalent.

Say we have a trip 🚌C that also stops at stop 🚏X, arriving at 12:10. Trip 🚌A departs from 🚏X at 12:20 and trip 🚌B departs from 🚏X at 12:40. We only create the edge (🚌C, 🚏X)->(A, 🚏X) and not (🚌C, 🚏X)->(🚌B, 🚏X) because we assume no person will wait 20 extra minutes to take 🚌B when 🚌A goes along the exact same stops and departs sooner.

This assumption greatly reduces the number of edges in the network. With the Belo Horizonte data, this assumption cut edges by about 80%, roughly 18.5 million fewer edges!

Indirect transfer edge generation

To generate indirect transfers, we use the following process:

  1. generate a spatial index (k-d tree) of all stops
  2. for each stop 🚏X, find the n closest stops (neighbors)
    1. for each neighbor 🚏Y:
      1. compute estimated walking time between 🚏X and 🚏Y
      2. create edges between trips going through 🚏X and 🚏Y, sticking to the same constraints laid out for direct transfers (valid and soonest equivalent transfers)

The network edge count and processing time vary with n. Increasing n will, of course, increase edge count and processing time. I've been using n=3 but this will likely vary depending on the specifics of the public transit network you're looking at.

Generate inter-trip edges

Finally, we generate edges between nodes that share a trip. We link these nodes in their stop sequence order. For example, for a trip 🚌A with stop sequence 🚏X->🚏Y->🚏Z, where 🚏X and 🚏Z are both transfer stops, we should have a edge (🚌A,🚏X)->(🚌A,🚏Z). (🚌A,🚏Y) isn't in the graph because it isn't a transfer node.

Trip routing

Once the transfer network is ready we can leverage Dijkstra's algorithm to generate the trip routes. However, we want to accept any arbitrary start and end stops, whereas the nodes in our network only represent transfer stops. Not all stops are included. We use the following procedure to accommodate for this:

  1. Given a starting stop 🚏s and departure time 🕒dt, find all trips 🚌T_s that go through 🚏s after 🕒dt.
  2. Given a ending stop 🚏e, find all trips 🚌T_e that go through 🚏e after 🕒dt.
  3. If there are trips in both 🚌T_s and 🚌T_e, that means there are trips from start to finish without any transfers. Return those and finish. Otherwise, continue.
  4. For each potential starting trip 🚌t_s in 🚌T_s, find the transfer stop soonest after 🚏s in 🚌t_s (note that this could be 🚏s itself) to be an entry node into the transfer network. We'll denote these nodes as ⭕N_s. For example, if we have trip 🚌A with stop sequence 🚏X->🚏Y->🚏Z and 🚏X is our starting stop, but not a transfer stop (meaning it's not in our transfer network), and 🚏Y is a transfer stop, we get the node (🚌A, 🚏Y). If 🚏X is a transfer stop, then we just get (🚌A, 🚏X).
  5. For each potential ending trip 🚌t_e in 🚌T_e, find the transfer stops closest to 🚏e in 🚌t_e (which could be 🚏e itself). We'll denote these nodes as ⭕N_e. This is just like step 4 except we go in reverse. For example, if we have trip 🚌A with the stop sequence 🚏X->🚏Y->🚏Z and 🚏Z is our starting stop but not a transfer stop, and 🚏Y is a transfer stop, then we get the node (🚌A, 🚏Y).
  6. Create a dummy start node, ⭕START, and create edges from it to each node in ⭕N_s, where the edge weight is the travel time from stop 🚏s to that node.
  7. Create a dummy end node, ⭕END, and create edges from each node in ⭕N_e to it, where the edge weight is the travel time from the node to stop 🚏e.
  8. Use Dijkstra's algorithm to find the shortest weighted path from ⭕START to ⭕END. This path is our trip route.


On the Belo Horizonte data (630 routes with 9,428 stops) with n=3 for indirect transfer generation we get a graph of 239,667 nodes and 4,879,935 edges. I haven't yet figured out a way to further reduce this edge count.

The transfer network generation takes about 20 minutes on my ThinkPad X260. I'm not certain a direct comparison is appropriate, but this is significantly faster than the 12 hours the other implementation took. The actual route generation doesn't seem any faster or slower than the other implementation.

While the routing is fairly fast, it definitely could be faster, and is not fast enough for direct usage in PolicySpace. We won't be able to compute individual trip routes for every one of Brazil's ~209 million inhabitants, but we were never expecting to either. This trip routing feature will be used by some higher-level interface which will cache routes and batch compute more approximate routes between regions (called "transportation analysis zones" in the transit demand modeling literature) to cut down on processing time. We're still figuring out specifics.

Anyway, this is version 1 of this trip routing method -- I'd appreciate any feedback and suggestions on how to improve the approach. I need to do more testing to get a better sense of the generated routes' quality.

The code for this implementation at time of writing is available here. There are a couple names different in the code than described here, e.g. the transfer network is called a "trip network", but the algorithm itself should be the same.

ProtonMail Bridge & Mutt

ProtonMail recently released their Linux beta for Bridge, which provides IMAP/SMTP access to the service. Prior to Bridge you could only access the service through the web interface, which is sort of clunky and requires you to, among other things, rely on their search, which is limited by the fact that they can't really index your emails - because you're paying them not to read the message bodies!

ProtonMail provides instructions for setting up the Bridge with common email applications like Thunderbird, but that's about it. So here's how to set it up with NeoMutt and OfflineIMAP for fetching our emails.

(My full email setup also includes the common Mutt companions NotMuch for better searching and urlscan for viewing URLs more easily in emails, in addition to some custom scripts, such as one for viewing HTML emails in a nice popover window and one for viewing MHT/MHTML emails (which are emails that contain inline attachments). It's too much to cover here, but if you want to poke around these scripts and my full email configs (at time of writing), see my dippindots.)

Installing NeoMutt and OfflineIMAP

These instructions are for Ubuntu 16.04, but I imagine they aren't much different for other distributions (yours might even have a package you can install).

Install dependencies:

sudo apt install -y xsltproc libidn11-dev libsasl2-dev libnotmuch-dev --no-install-recommends

Then grab the latest NeoMutt release, extract, and build:

./configure --disable-doc --ssl --sasl --notmuch
sudo make install

# so we can just access it via `mutt`
sudo ln -s /usr/bin/neomutt /usr/bin/mutt

Then install OfflineIMAP:

sudo pip install offlineimap

Running the Bridge

The Bridge can be run from the command line with the Desktop-Bridge program. By default this opens a GUI to setup your accounts, but you can also access a console interface with Desktop-Bridge --cli.

If you aren't already logged in you need to run the login command in this interface.

Configuring OfflineIMAP

First thing to do is configure OfflineIMAP to access our ProtonMail emails.

OfflineIMAP looks for a config at ~/.offlineimaprc. My config at time of writing is:

accounts = main

[Account main]
localrepository = main-local
remoterepository = main-remote

# full refresh, in min
autorefresh = 0.2

# quick refreshs between each full refresh
quick = 10

# update notmuch index after sync
postsynchook = notmuch new

[Repository main-local]
type = Maildir
localfolders = ~/.mail

# delete remote mails that were deleted locally
sync_deletes = yes

[Repository main-remote]
type = IMAP
remoteport = 1143
remotehost =
remoteuser = <YOUR EMAIL>
keepalive = 60
holdconnectionopen = yes

# delete local mails that were deleted on the remote server
expunge = yes

# sync only these folders
folderfilter = lambda foldername: foldername in ['INBOX', 'Archive', 'Sent']

# is broken, but connecting locally to bridge so should be ok
ssl = no

Basically this sets up an account arbitrarily called main which will store emails at ~/.mail in the Maildir format. It will only sync the INBOX, Archive, and Sent folders to/from ProtonMail (the folderfilter option). Emails deleted locally will also be deleted on ProtonMail (the sync_deletes option) and emails deleted on ProtonMail will be deleted locally (the expunge option).

After OfflineIMAP fetches new email, it will run the command defined for postsynchook, which in this case is is the notmuch command for updating its search index (notmuch new).

Important Bridge-related things to note:

  • Bridge generates a Bridge-specific password for you to use, so use that here and not your actual ProtonMail password.
  • Bridge's IMAP service runs at (normally IMAP runs on port 143 or 993 for SSL)
  • Disable SSL because it was (at least when I set this up) not working with Bridge. But this seems like a non-issue because it's over a local connection anyways and the actual outgoing connection to ProtonMail is encrypted.

Then try running it using the offlineimap command.

Configuring NeoMutt

There is a lot to configure in NeoMutt, so I'll only cover what is necessary to get this setup working. If you're interested in seeing more, my NeoMutt config at time of writing is available here.

NeoMutt looks for a config at ~/.muttrc. To get it working with OfflineIMAP and to send emails with SMTP you need at least:

# "+" substitutes for `folder`
set mbox_type=Maildir
set folder=~/.mail/
set record=+Sent
set postponed=+Drafts
set trash=+Trash
set mail_check=2 # seconds

# smtp
source ~/docs/keys/mail
set smtp_url=smtp://$my_user:$my_pass@
set ssl_force_tls
set ssl_starttls

Where my ~/docs/keys/mail file has contents in the format:

set my_user=<YOUR EMAIL>

Important Bridge-related notes:

  • The SMTP port is 1025 (typically it's 587)
  • See the previous note on Bridge-specific password

That should be all you need.

"Daemonizing" the Bridge

There currently is no way to daemonize the Bridge, but here's a workaround using tmux:

tmux new-session -d -s mail 'Desktop-Bridge --cli'

This just opens up a new tmux session and runs the Bridge inside of it.