Getting in front of pharma: Automated public discovery of drug candidates

matter.farm
matter.farm

A couple weeks ago Sean and I were fortunate enough to participate in another edition of Rhizome's 7x7, this time in Beijing in collaboration with the Chinese Central Academy of Fine Arts (CAFA). I was very excited to collaborate with Sean again after our first collaboration in New York at the New Museum, and to have the chance to try something different together.

We thought about revisiting our previous project, cell.farm, which was a proposal for a cryptocurrency/distributed computing system for which the proof-of-work protocol involved computing simulation updates for an atomic-level model of a human cell (though our proposal initially suggested simulating a ribosome). Such detailed simulation of biological processes would be a boon for medical research, but simulating even the simplest cell at that resolution is so computationally demanding that it's infeasible even for the world's best supercomputers. But the aggregate computing power of the Bitcoin network is orders of magnitudes larger than any supercomputer, and might be able to run such a model in a reasonable amount of time. By adopting that model for in silico cells, a crucial part of medical research is essentially collectivized, and as part of our design, so too are the results of that research. The project bears similarity to Folding@Home and its crypto-based derivatives (e.g. FoldingCoin), but as far as I know none of these projects explicitly distribute ownership of the research that results from the network. There were also some design details that we didn't have time to hash out, and we left open a big question of computational verifiability: given a simulation update from a node, how can you be certain that they actually computed that value rather than returned some random value? (Golem has this problem too, the difficulty of which is discussed a bit here).

cell.farm
cell.farm

This time around, rather than a project about medical research abstractly, we focused specifically on the pharmaceutical industry, the 1.1 trillion dollar business lying at the nexus of intellectual property law, predatory business practices, and the devaluing of human life.

The pharmaceutical industry

“Is curing patients a sustainable business model?” Goldman Sachs analysts ask
“Is curing patients a sustainable business model?” Goldman Sachs analysts ask

(For background I'm going to lean heavily on the "Pill of Sale" episode of the Ashes Ashes podcast which goes into more detail about the pharmaceutical industry -- definitely worth a listen.)

Most Americans are familiar with exorbitantly-priced drugs -- if not directly than via one of the many horrifying stories of people crowdfunding their continued existence or flying elsewhere to access more reasonable prices. A hepatitis C cure from Gilead, Solvadi, costs $84,000 for a 12-week course and is the subject of a recent Goldman Sachs report. The report describes cures as effective as Solvadi (up to 97%) as bad for business since you cure yourself out of a market. Even something as common insulin can cost a significant portion of income -- to the point where people die from needing to ration it.

This hostile environment is thinly justified with rhetoric around drug development costs and enforced through the patent law system, all under the implicit, sometimes explicit, assumption that it is necessary for drug companies to make a profit on their drugs. Patents provide exclusive rights for a company to sell a particular drug; this temporary monopoly essentially gives them carte blanche to set whatever price they want so that they recoup the drug development costs, so the story goes. These patents last 20 years and can basically be extended by "exclusivity" periods which add up to another 7 years. A drug may take 10-15 years to develop, leaving a window of at least 5 years of exclusive rights to produce and sell it. "Orphan drugs", drugs that treat rare conditions, may have longer monopolies to compensate for the smaller market size. After this period generics are permitted to enter the market, which drives the cost down, but there are all sorts of tricks available that can prolong this protection period even further, a practice called "evergreening". For example, slightly modifying how the drug is delivered (e.g. by tablet or capsule) can be enough for it to essentially be re-patented.

(It's worth noting that prices can be high even for generics. For example, epinephrine -- commonly known as an EpiPen, essential for severe allergic reactions -- can be bought for about 0.10-0.95USD outside the US, whereas generics in the US can cost about $70.)

Drug development pipeline. From: Pharmaceutical Research and Manufacturers of America, Drug Discovery and Development: Understanding the R&D Process, www.innovation.org.
Drug development pipeline. From: Pharmaceutical Research and Manufacturers of America, Drug Discovery and Development: Understanding the R&D Process, www.innovation.org.

Drug development is expensive, averaging at over $2.5 billion per drug, and that's only counting for those that gain FDA approval. However, these exclusivity rights are not merely used to recapture R&D costs, as is often said, but instead to flagrantly gouge prices such that the pharmaceutical industry is tied with banking for the largest profit margins of any industry (as high as 43% in the case of Pfizer).

The narrative around high drug development costs also takes for granted that pharmaceutical companies are the ones bearing all of these costs. A considerable amount of the basic research that is foundational to drug development is funded publicly; the linked study found that public funding contributed to every drug that received FDA approval from 2010-2016. The amount of funding is estimated to be over $100 billion.

It used to be that inventions resulting from federal funding remained under federal ownership, but the 1980 Bayh–Dole Act offered businesses and other institutions the option to claim private ownership. The result is the public "paying twice" for these drugs. The Act does preserve "march-in rights" for the government, allowing the government to circumvent the patent and assign licenses independently if the invention is not made "available to the public on reasonable terms", but as of now these rights have never been exercised. In 2016 there was an unsuccessful attempt to use these march-in rights to lower the price of a prostate cancer drug called Xtandi, priced at $129,000/year.

All of this isn't to say that the work of the pharmaceutical industry isn't valuable; drugs are a necessary part of so many peoples' lives. I recently started using sumatriptan to deal with debilitating migraines, and am hugely grateful it exists (and is not ridiculously expensive). It's because pharmaceuticals are such a critical part to life that their development and distribution should not be dictated the values that currently shape it.

One particularly egregious example of this mess is the nightmare scenario of Purdue Pharmaceuticals, owned by the Sackler family (who are also prolific patrons of the arts), producers of OxyContin (accounting for over 80% of their sales last year), basically responsible for the ongoing opioid crisis (affecting at least 2.1 million Americans directly, and many more collaterally), and recently granted a patent for a drug that treats opioid addiction. The patented treatment is a small modification of an existing generic.

The day before our 7x7 presentation a story broke in the Guardian: "Sackler family members face mass litigation and criminal investigations over opioids crisis".

Computational drug discovery

One reason drug development is so difficult is that the space of possible drug compounds is extremely large, estimated to be between 1060 and 1063 compounds. For comparison, there are an estimated 1022 to 1024 stars in the entire universe, and according to this estimate about 1049-1050 atoms making up our entire world.

PubChem's chemical space, from "Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database"
PubChem's chemical space, from "Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database"

Drug development is in large part a search problem, looking to find useful compounds within this massive space. A brute-force search is impossible; even if it took only a couple seconds to examine each possible compound you'd see several deaths of our sun (a lifespan of about 10 billion years) before fully exploring that space.

More effective techniques for searching this space include slightly modifying existing drugs for different therapeutic applications ("me-too" compounds) and literally looking at plants and indigenous medical traditions for leads (this general practice is called "bioprospecting" and this particularly colonialist form is called "biopiracy").

Of course with the proliferation of machine learning there is a big interest in searching this space computationally. Two main categories are virtual screening (looking through known compounds for ones that look promising) and molecular generation (generating completely new compounds that look promising). We focused on molecular generation for reasons described below.

A primary goal for matter.farm is to publicize this work in computational drug discovery and also help researchers use these generated compounds as potential leads for new beneficial drugs. With our system, which is also open source, independent and institutional researchers alike can access automated drug discovery technology and hopefully accelerate the drug development process.

Prior art and public discovery of drugs

(For this section we spoke with a patent lawyer who requested that we note that they are not representing us.)

One crucial criteria for a patent is that the invention must be novel; that is, the invention cannot have already been known to the public. An existing publicly-known instance of an invention is called "prior art" and can invalidate a patent claim. However, sufficient variations to an invention may qualify it as original enough to be patentable (this is the idea behind evergreening, described above).

If a drug is discovered and made public prior to a patent claim on it, it would function as prior art and make that compound un-patentable in its current form. If we were able to generate new molecules that could function as useful drugs, and make public those new molecules, then perhaps we can prevent companies from patenting them and maintaining a temporary monopoly on their distribution.

This is the second goal of matter.farm: develop an automated drug discovery system to find and publish useful drugs so they cannot be patented.

Additional efforts

Other efforts to the address problems with pharmaceutical industry can be found in initiatives like Medicare for All and the proposed Prescription Drug Price Relief Act, and the organizing happening around those. The issues with the pharmaceutical industry are just one piece of a more general hostility in American healthcare.

There is also a burgeoning DIY medicine movement which aim to build alternatives to industrialized medicine, providing autonomy, access, and reliability where those are normally withheld. For example, the artist Ryan Hammond is working on genetically modifying tobacco plants to produce estrogen and testosterone, and the Four Thieves Vinegar Collective (discussed in the Ashes Ashes "Pill of Sale" episode) provides instructions for a DIY EpiPen and a DIY lab ("MicroLab") for synthesizing various pharmaceuticals, including Naloxone and Solvadi.


That's it for the background of the project. The following section describes how the system works in more detail.

The project code is available here.

How it works

The complete matter.farm system involves three components:

  1. A molecular generation model, using a version of the graph variational autoencoder ("JTNN-VAE") described in [2], modified to be conditional ("JTNN-CVAE"). This generates new compounds given a receptor and action, e.g. a nociceptin receptor agonist.
  2. An ATC code prediction model. The Anatomical Therapeutic Chemical (ATC) classification system categorizes compounds based on their therapeutic effects. We use this to estimate what a generated compound might treat.
  3. A retrosynthesis planner. Retrosynthesic analysis is the process of coming up with a plan to synthesis some target compound from an inventory of base compounds (e.g. compounds you can purchase directly from a supplier). This is necessary to meet the enablement requirement of prior art; that is, it's not enough to come up with a new compound, you also need to sufficiently demonstrate how it could be synthesized.

To train these various models we relied on a number of public data sources, including PubChem, UniProtKB, STITCH, BindingDB, ChEMBL, and DrugBank.

Common chemical compound representations, from Prediction methods and databases within chemoinfromatics: emphasis on drugs and drug candidates.
Common chemical compound representations, from Prediction methods and databases within chemoinfromatics: emphasis on drugs and drug candidates.

Chemical compounds can be represented in a number of ways, e.g. as a 2D structural diagram or a 3D model. One of the most portable formats is SMILES, which represents a molecule as an ASCII string, and is what we use throughout the project.

Clustering

For training the JTNN-CVAE model we needed to cluster the compounds in a meaningful way. At first we didn't look at receptors and actions but rather tried to leverage the vast published chemical research literature (in PubMed) and USPTO patents.

For the first attempt we tried learning word2vec embeddings from PubMed article titles and abstracts, then representing documents using TF-IDF as described in the WISDM algorithm, and finally clustering using DBSCAN or OPTICS. This ended up being way too slow, memory intensive, and limited in what clustering algorithms we could try.

The second attempt involved generating a compound graph such that an edge exists between compound A and B if they appear in an article or patent together. So instead of linking compounds based on the content of the articles they're mentioned in, they're linked solely on the virtue of being mentioned together in an article, under the assumption that this indicates some meaningful similarity. Then we ran a label propagation community detection algorithm to identify clusters within the graph. The graph was fairly sparse however and in the end looking at the connected components seemed to be enough. There were still some limitations with speed and memory that led us to abandon that approach.

Finally we decided to cluster based on receptors compounds were known to interact with. This reduced the amount of compounds we were able to look at (since the compounds for which receptor interactions are known are much less than the total of all known compounds), but the data was richer and more explicit than co-mentions in text documents. A drug's effects are determined by the receptor it targets and how it interacts with that receptor (does it activate it, does it block it, etc). For instance, OxyContin is a mu-type and kappa-type opioid receptor agonist (it activates them), and Naloxone (used to treat opioid overdose) is a mu-type and kappa-type opioid receptor antagonist (it blocks them).

G-Protein Coupled Receptor, from Random42
G-Protein Coupled Receptor, from Random42

The ChEMBL data has information about both receptors and the type of interaction (agonist, antagonist, etc), whereas BindingDB has information about only the receptors but for more compounds. So we used a two-pass approach: for the first pass, we create initial clusters based on receptor-action types and on the second-pass we augment these clusters with the BindingDB compounds, assigning them to the cluster that matches their target receptor and has the highest fingerprint similarity to the cluster's current members. This resulted in 461 receptor-action clusters.

Molecular generation model

The JTNN-VAE model described in [2] is a variational autoencoder that handles molecular graphs. A variational autoencoder is a generative model that learns how to compress ("encode") data in such a way that it can be reliable decompressed ("decoded"). It accomplishes this by learning an underlying probability distribution that describes the data. Once the model has learned this distribution you can sample it to generate new data that looks like the old data. This post provides an overview on variational autoencoders.

We modified the model to be conditional, allowing us to sample from the learned probability distribution conditioned on the cluster (the receptor-action) we want to generate new compounds for.

ATC code prediction model

The ATC code prediction model is a straightforward multiclass neural network. Though ATC code prediction is technically a multilabel problem (a compound may have more than one ATC code), most compounds had only one code, so we treated it as a multiclass problem. ATC codes have 5 levels, from low detail to high detail; we predict level 3 codes ("pharmacological/therapeutic/chemical subgroup"), with one class for each level 3 code. We use 2048-bit Morgan fingerprints as representations for the compounds and achieved about 80% accuracy with this naive approach -- not ideal, but fine for our purposes and time constraints.

Retrosynthesis planner

Here's where it came down to the wire. We attempted to implement the model ("3N-MCTS") described in [7], which involves Monte Carlo Tree Search (MCTS) and three policy networks. The policy networks predict what reaction rules might apply to a given compound. The paper's model is trained on Reaxys data, which is way too expensive for us, so we used a smaller dataset extracted from USPTO patents (from [26]). Our implementation is basically complete but we didn't have enough time to train the models.

In an 11th-hour Hail Mary we used MIT's ASKCOS system which got us mostly partial synthesis plans. At some point we should revisit the 3N-MCTS system to see if we can get that working.

Sampling and filtering

Once all the models were ready we sampled 100 new compounds for each of the 461 receptor-action clusters. Then we filtered down to valid compounds (with no charge) and to those not present in PubChem's collection of 96 million compounds. We ended up whittling the set down to about 15,000 new compounds for which we then predicted ATC codes and generated synthesis plans.

Future work

Time was fairly tight for the project so we didn't get to tune or train the system as much as we wanted to. And it would have been nice to try to experiment with more substantial modifications of the models we used. But for me this project was a wonderful learning experience and renewed an interest in chemistry. I want to spend some more time in this area, especially in materials science because of its relevance to lower-impact technologies, such as this cooling material.

References

  1. Botev, Viktor, Kaloyan Marinov, and Florian Schäfer. "Word importance-based similarity of documents metric (WISDM): Fast and scalable document similarity metric for analysis of scientific documents." Proceedings of the 6th International Workshop on Mining Scientific Publications. ACM, 2017
  2. Jin, Wengong, Regina Barzilay, and Tommi Jaakkola. "Junction Tree Variational Autoencoder for Molecular Graph Generation." arXiv preprint arXiv:1802.04364 (2018).
  3. Kusner, Matt J., Brooks Paige, and José Miguel Hernández-Lobato. "Grammar variational autoencoder." arXiv preprint arXiv:1703.01925 (2017).
  4. Goh, Garrett B., Nathan O. Hodas, and Abhinav Vishnu. "Deep learning for computational chemistry." Journal of computational chemistry 38.16 (2017): 1291-1307.
  5. Yang, Xiufeng, et al. "ChemTS: an efficient python library for de novo molecular generation." Science and technology of advanced materials 18.1 (2017): 972-976.
  6. Liu, Yue, et al. "Materials discovery and design using machine learning." Journal of Materiomics 3.3 (2017): 159-177.
  7. Segler, Marwin HS, Mike Preuss, and Mark P. Waller. "Planning chemical syntheses with deep neural networks and symbolic AI." Nature 555.7698 (2018): 604.
  8. Kim, Edward, et al. "Virtual screening of inorganic materials synthesis parameters with deep learning." npj Computational Materials 3.1 (2017): 53.
  9. Josse, Julie, Jérome Pagès, and François Husson. "Testing the significance of the RV coefficient." Computational Statistics & Data Analysis 53.1 (2008): 82-91.
  10. Cordasco, Gennaro, and Luisa Gargano. "Community detection via semi-synchronous label propagation algorithms." Business Applications of Social Network Analysis (BASNA), 2010 IEEE International Workshop on. IEEE, 2010.
  11. Community Detection in Python
  12. Jin, Wengong, Regina Barzilay, and Tommi Jaakkola. "Junction Tree Variational Autoencoder for Molecular Graph Generation." arXiv preprint arXiv:1802.04364 (2018).
  13. What are the differences between community detection algorithms in igraph?
  14. Summary of community detection algorithms in igraph 0.6
  15. Wang, Yong-Cui, et al. "Network predicting drug’s anatomical therapeutic chemical code." Bioinformatics 29.10 (2013): 1317-1324.
  16. Liu, Zhongyang, et al. "Similarity-based prediction for Anatomical Therapeutic Chemical classification of drugs by integrating multiple data sources." Bioinformatics 31.11 (2015): 1788-1795.
  17. Cheng, Xiang, et al. "iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals." Oncotarget 8.35 (2017): 58494.
  18. Szklarczyk D, Santos A, von Mering C, Jensen LJ, Bork P, Kuhn M. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2016 Jan 4;44(D1):D380-4.
  19. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, Assempour N, Iynkkaran I, Liu Y, Maciejewski A, Gale N, Wilson A, Chin L, Cummings R, Le D, Pon A, Knox C, Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 Nov 8. doi: 10.1093/nar/gkx1037.
  20. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH. PubChem Substance and Compound databases. Nucleic Acids Res. 2016 Jan 4; 44(D1):D1202-13. Epub 2015 Sep 22 [PubMed PMID: 26400175] doi: 10.1093/nar/gkv951.
  21. Gilson, Michael K., et al. "BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology." Nucleic acids research 44.D1 (2015): D1045-D1053.
  22. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45: D158-D169 (2017)
  23. Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E,
    Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017)
    'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.
  24. Papadatos, George, et al. "SureChEMBL: a large-scale, chemically annotated patent document database." Nucleic acids research 44.D1 (2015): D1220-D1228.
  25. Lowe, Daniel Mark. Extraction of chemical structures and reactions from the literature. Diss. University of Cambridge, 2012.
  26. Lowe, Daniel (2017): Chemical reactions from US patents (1976-Sep2016). figshare. Fileset. CC0 License.
  27. Liu, Bowen, et al. "Retrosynthetic reaction prediction using neural sequence-to-sequence models." ACS central science 3.10 (2017): 1103-1113.
  28. Klucznik, Tomasz, et al. "Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory." Chem 4.3 (2018): 522-532.
  29. Segler, Marwin HS, and Mark P. Waller. "Neural‐Symbolic Machine Learning for Retrosynthesis and Reaction Prediction." Chemistry–A European Journal 23.25 (2017): 5966-5971.
  30. Law, James, et al. "Route designer: a retrosynthetic analysis tool utilizing automated retrosynthetic rule generation." Journal of chemical information and modeling 49.3 (2009): 593-602.
  31. Schwaller, Philippe, et al. "“Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models." Chemical science 9.28 (2018): 6091-6098.
  32. Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).
  33. Yang, Zhilin, William W. Cohen, and Ruslan Salakhutdinov. "Revisiting semi-supervised learning with graph embeddings." arXiv preprint arXiv:1603.08861 (2016).
  34. Kipf, Thomas, et al. "Neural relational inference for interacting systems." arXiv preprint arXiv:1802.04687 (2018).
  35. Wei, Jennifer N., David Duvenaud, and Alán Aspuru-Guzik. "Neural networks for the prediction of organic chemistry reactions." ACS central science 2.10 (2016): 725-732.
  36. Coley, Connor W., et al. "Prediction of organic reaction outcomes using machine learning." ACS central science 3.5 (2017): 434-443.
  37. Gupta, Anvita. "Predicting Chemical Reaction Type and Reaction Products with Recurrent Neural Networks."
  38. Plehiers, Pieter P., et al. "Automated reaction database and reaction network analysis: extraction of reaction templates using cheminformatics." Journal of cheminformatics 10.1 (2018): 11.
  39. MIT's ASKCOS ("Automated System for Knowledge-based Continuous Organic Synthesis")

Culture: A Social Network Simulator

This is a proposal for Culture, a social network simulator designed and developed to teach students about bot development.

This proposal was originally developed for a class on "news bots" I was scheduled to teach in the fall of 2017 (I ended up having a conflict and was unable to teach it). I wanted students to not only explore the impact of bots from a theory perspective, but also engage hands-on to see just how radically influential these bots are on social media platforms.

And not only bots. Ideally students would take on the role of other actors in social media ecosystems, such as a "traditional" media publication, or as an advertiser, or as a political candidate, or as a influencer, or even as the platform itself, making decisions around aspects such as the newsfeed algorithm.

Unfortunately, there are a number of challenges that make hands-on experience infeasible with live social networks:

  • Ethical concerns. For example, many bots are meant to deceive and manipulate, and we'd be working with real user data.
  • Issues of access. For example, rate-limiting and limited access to data. For privacy reasons APIs generally don't provide sensitive user data to developers, though some such data may be provided to advertisers. And of course, with live social networks there isn't a way for students to change the newsfeed algorithms for the entire network.
  • Limits of reality. For example, a student can't magically become an influencer on Twitter, but in a simulated setting, they can.

There are also some technical obstacles, namely that students taking the class weren't required to have any programming background and I didn't want to spend too much time on introductory programming lessons. Even if students were fairly experienced in programming, working with bots has a lot of advanced challenges, such as dealing with natural language. A simulated social network can be simplified so that these problems are easier to deal with.

This proposal doesn't really have a strong advertising component. After speaking with Irwin Chen about it, I realized it's a pretty big omission. So an updated proposal will include all of that: selling ads, ad targeting, ad exchanges, etc. It's not an area I know well, so I'd have to speak with some people and do some research before sketching that out.

Overview

Culture will be an agent-based simulation of a simple social network modeled off of Twitter. As such, the simulation will consist of the following (each part is elaborated further below):

  • users communicate in a rudimentary language
  • users have different personalities
  • each user will have a feed of messages from people they follow and include promoted/ad messages
    • here students can potentially design their own news feed algorithms and see how that affects individual/public opinion
  • users can message, post media, block, be blocked, be banned, follow, unfollow
  • messages and media influence users
  • the network responds to and affects outside events

Motivation

So much of our exposure to and understanding of the world beyond our immediate experience is mediated by social networks, which is to say by newsfeed algorithms and other individual users of these networks. Students should develop a stronger literacy in these dynamics if they are to adequately navigate this information ecology.

This literacy is best developed by direct interaction with these social networks, such as Twitter or Facebook, rather than through theory alone. However, working directly with these networks may be impractical in that they are massive, closed-source, and limited in access. For instance, due to API limits it is impossible to survey or conduct analyses of the entire population of the network, or to examine in detail its inner operations.

Furthermore, there is no room for counterfactual speculation in these existing social networks. For example, we can't intervene and change the behaviors of all users and see how information propagation changes as a result. This limits the pedagogical value of working directly with, for example, Twitter or Facebook.

A simulated social network addresses these concerns. It can be designed to model the dynamics of its real counterparts, it can be entirely open in that students have access to all the network's data, and its parameters can be tweaked to see how information propagation evolves under different circumstances. Students can develop bots on this network without worrying about API limits, spam protection, and so on. In contrast to the black-box nature of a real social network, a simulated social network functions more like a sandbox.

Agents

The simulated agents are individual users of the social network. They are randomly generated to have particular personalities and interests (see below). Their generation is part of the simulation's initialization. Students do not directly interact with these agents, but can indirectly interact with them via, for example, ads and bots they create (see below).

Language

Dealing with natural language is difficult even for experienced developers and advanced researchers in the topic. Broadly, the problem of natural language in the context of bots can described in two parts: understanding and generation. Both are very difficult and beyond the scope of the courses that this simulation is designed for, which includes introductory classes.

To avoid dealing with natural language, the simulation will consist of a very basic grammar and a relatively small vocabulary which can be easily expanded as needed. Because of its relatively simplicity, the same natural language processing techniques that are currently used for "real" languages can also be applied, but with greater success, and better yet, simpler heuristics will go a longer way. Thus students will not need to have a deep understanding of, for example, word vectors or TF-IDF, but may develop their own simpler techniques that will still be effective.

This simpler language will consist of verbs, nouns, and modifiers (adjectives and adverbs) (collectively, "terms"). Because the courses are assumed to be taught in English, this language will be reflective of English.

These terms are combined into formal propositional statements, e.g. single-payer-healthcare + country -> < freedom, which expresses the opinion that implementing single payer health care in this country will cause (->) a loss (<) of freedom. (This is just a sketch of the syntax; it's subject to change).

This is a bit limiting; there is no room for poetics, for instance, but will provide a strong starting point that can be expanded on later.

The design of this language will involve developing a network of terms (i.e. defining term associations), such that terms represent mixtures of other terms and values in the simulation (e.g. individuality/collectivism, see "Personalities" below). This term association network is opaque to the students; they do not get to see what these terms mean to the agents in the simulation. As with the real world, they must use algorithms or their own intuition from observing the network to determine what language best communicates their messages.

For example: the term "car" may be connected to the terms "individuality" and "freedom" to establish that the term "car" symbolically evokes these two ideas. We could then imagine ads for "cars" appeal more to agents with personalities that align more with those concepts relative to agents who, for example, align more with "collectivity" and "freedom".

Terms also have sentiment valences, e.g. "bad" may have a valence of -0.5 to express a negative opinion, whereas "terrible" may have a stronger valence of -0.8, and so on.

Ideally, this term association network is not objective but rather subjective; i.e. differs depending on the particular agent. For example, the term "freedom" may be associated with different values for one agent than for another. However, it is likely that this will be computationally infeasible (though some kind of heuristics could be developed to simplify it).

This term association network also changes over time as terms are used in slightly different contexts. This provides a way for the meaning of terms to change or be entirely inverted, e.g. a negative term being co-opted as a positive identifying term for a group.

The language is the part of the simulation that will require most care in designing - it needs to represent important aspects of how language is used in social networks (e.g. to express opinion/judgement, to harass/abuse, to make propositional statements, etc).

Personalities

Simulated agents will have could loosely be described as "personalities"; that is, a set of parameters that determines how the agent interacts with others (e.g. aggressiveness/friendliness, within-bubble/outside-bubble, etc) and what their values are (e.g. conservative/progressive, individualist/collectivist, etc). These personalities will be generated randomly, via a Bayes Net (or some similar probabilistic model) that will be editable in some way. A model like a Bayes Net lets us describe assumed relationships between values (e.g. more collectivist agents are more likely to be friendly).

These personalities also determine who agents tend to interact with (under principles of homophily, i.e. like attracts like) and also what kind of messaging resonates with them (e.g. messages about rugged individuality will resonate more with individualist agents).

Messages

"Messages" are the equivalent of Twitter's tweets. Agents compose their own messages based on their personalities and who they are interacting with. Messages may affect an agent's mood and also their personality (see below).

Media

Text is not the only important part of a social network - memes and other media (news stories, videos, etc) form a crucial part of their information flow.

States

Agents' states include their personalities, in addition to other attributes like mood and use frequency (how often they visit the social network) and post frequency (how often they post messages). Mood may affect, for example, how agents interact with other agents (e.g. with more or less hostility). This can be used to model emotion contagion.

Influence

Based on who they interact with and what other messaging they are exposed to (e.g. targeted ads), the personalities (traits/opinions) of an agent may shift over time. Various social phenomena, e.g. bipolarization, can be modeled here.

Events

Social networks are not closed systems; they do not exist in isolation. The "outside" world affects what goes on in network, just as what goes in the network can spill out and effect the outside world.

Part of the simulation will support external events (also simulated) that affect and can be affected by the social network, such as an election. The outside event(s) affect what are popular topics (i.e. topics that are relevant and agents are more likely to talk about and respond to) and they can be defined to have some relationship to the shape of discourse in the network.

Social Network

In this section, "social network" is used not to refer to the platform itself, but to the actual network of relationships between users (expressed by "following" relationships). Some users may be highly connected (many followers), and students may, for example, as part of their strategy (whether for ads or opinion influence) try to target these opinion leaders.

Visualization

It will likely be too computationally taxing to display all activity on the social network, but students will have access to various views that provide summaries (i.e. mean sentiment towards some topic, number of users talking about a topic, etc). Ideally an API can be provided like a real social network, so that students can build their own visualizations as part of their bot development process, but this may be limited by the size of the simulation.

Bot API

A simple API will be provided for students to develop their own bots that interact with this network. These bots can follow, be followed, message, etc like agents can and will be the primary way students interact with the social network.

What distinguishes bots from simulated agents is that bots are designed and controlled by students, whereas the simulated agents represent "real" users of the network.

Learning Objectives

The network functions as a simplified social landscape for students to understand how ads, bots, and news feed algorithms affect opinion, trends, and discussion on a social network, and how that links up with broader spheres of discourse outside of the network. Some students may, for example, design bots that influence opinion in a certain direction, while others may design bots to influence opinion in a different direction, while still others may design bots that root out these interfering bots. Depending on how the network is designed, some students can be the managers of the social network platform.

The goal is for students to develop a comprehensive mental model about the dynamics of social media and communication in the internet age, to peek "behind the curtain" and develop a critical perspective when using social media and reading the news (i.e. develop social media literacy).

Extensions

In theory this simulated social network can be extended with features that could be present on any social network, such as anonymous accounts, different kinds of blocking and muting functionality, and so on. Thus it can also be a place where students can experiment with new features to see how that affects dynamics on the network.

Variants

Ideally the simulator accommodates students who are comfortable with programming and those who aren't.

For students who aren't, bot templates could be provided which require little to no programming experience, or another layer can be developed where they "purchase" different bot, marketing, and so on services that run automatically.

If there are multiple classes going on, they can all work from the same simulation and take on different roles. If one class is focused on advertising, they can take on roles of the advertising ecosystem, while in another class perhaps they collectively take on the role of the platform. The potential for cross-class interactivity is exciting.


7x7 Cutting Room Floor

I was fortunate enough to participate in this year's edition of Rhizome's Seven on Seven with Sean Raspet. The event pairs an artist and a technologist and gives some limited time for the pair to come up with and implement a concept or project. In previous editions pairs only had a day or so; this time we had about a month.

It was enough time to churn through several ideas that never made it to the final presentation. We landed on producing a white paper proposing leveraging blockchain-based distributed computing to collectively simulate a complete human cell at the atomic level, starting with something relatively simple like a red blood cell. A human cell might have hundreds of trillions of atoms and so simulating one at the atomic resolution is basically infeasible with existing computational resources. But it is more feasible now than it was maybe a decade ago.

Through our research we came across some staggering statistics about the computing power of the Bitcoin network, namely that it is estimated to have an aggregate computing power of 80.7 zettaFLOPS (80.7 million petaFLOPS) as of May 2018. The world's reigning supercomputer, the Sunway TaihuLight, has a theoretical peak of 125 petaFLOPS. The Folding@Home network, which enables people to donate spare computing power for protein folding simulations, had an aggregate power of about 100 petaFLOPS in January 2018. Not bad for a volunteer distributed network, but still far off from the Bitcoin network. There are more details in the white paper, but those numbers stuck out.

Anyways, we went through a few ideas before we landed on this white paper. Our first focus was on the phosphorus commodity market in relationship to "peak phosphorus". This was something Sean had been researching for some time now, and for the past few months I've been poking around the agri-tech scene, so I was naturally drawn to it as a topic. The gist is that phosphorus is a mineral crucial to agriculture, a key component in fertilizers (along with nitrogen and potassium; the history of nitrogen fertilizer is very interesting and troubling one), and is basically a non-renewable resource (some can be recovered from waste but I'm not sure what percentage of it is recoverable). At some point in the relatively near future phosphorus extraction may become too expensive or difficult and that could lead to some serious food security crises. So we were thinking of various ways to represent this issue. Here are a few ideas we played around with.

Global phosphorus simulation

Phosphorus, like any resource-extractive industry, is global. We wanted to be able to convey geopolitical issues like Morocco's occupation of Western Sahara, which is where Morocco mines its phosphorus. The most straightforward way to do something like that is a 4X-style global simulation, so we played around with that first.

I designed a little framework for laying out a hex-based map (similar to the cartog library I created for my Simulation & Cybernetics class, but I wanted to support 3D):

3D hex-based maps
3D hex-based maps

That's about as far as we got in terms of implementation. But the general idea was that we'd model the dynamics of the global phosphorus market, with some shocks and random events, and projections of changes in relevant indicators like growth rates, meat consumption rates, and so on. And somehow you'd see these effects on this map and through changes in the price of commodity phosphorus.

I didn't want this hex map to be the only "output" of the simulation. We wanted to show that the macro-level dynamics of the phosphorus market are intimately connected to the health of individual plants, and so I wanted to setup an automated growing system as a more material visualization. The system would be hydroponic or aeroponic, with a phosphorus nutrient pump that releases more or less phosphorus depending on its simulated price. As peak phosphorus approaches, the plant's health starts to deteriorate as it manifests symptoms of phosphorus deficiency. There were a few issues here, namely that 1) it's a pretty big task to set up such a growing system, and 2) the changes in the plant's health would happen over long time scales relative to the simulation (e.g. one simulation year might run in one real minute, and the impacts on the plant's health might not be visible for a few real days).

Phosphorus deficiency in corn
Phosphorus deficiency in corn

A build on this idea we considered is that we'd reserve some set amount of funds for the plant, and it would actually have to "purchase" phosphorus from the nutrient reservoir on its own.

Commodity traders vs food consumers

For awhile I've wanted to make an asymmetric game which consists of two separate games that are at first glance unrelated. For example, on one side of the room is a relatively innocuous-looking life simulator game where you have to e.g. buy a house and care for your family. On the other side of the room is a stock market game where you just try to earn the highest return on your investments. What isn't apparent at first is that the actions of the player in the stock market game directly affect how difficult the life-simulator game is, for example, by triggering financial crises or affecting house prices.

We briefly considered doing something along these lines. The idea was that when we presented, we'd direct audience members to a website where they could join our phosphorus game. Some audience members would be redirected to the "commodity trader" version of the game, while others would instead be redirected to the "food consumer" version.

The commodity trader game is basically same as the stock market game, except just for phosphorus trading.

Commodity trader interface
Commodity trader interface

The food consumer game is built around a "basket", like a simplified version of a consumer price index focused on products especially affected by phosphorus prices. As a player you'd have some nutritional requirements to meet or some other purchasing obligations and some weekly budget with which to buy food. We didn't really get far enough to thoroughly think through the mechanics.

Food consumer interface
Food consumer interface

I did have a really fun time modeling the food:

Food models
Food models

Plant care Tamagotchi

Riffing off the plant-as-visualization idea, we also toyed around with the idea of some kind of plant-tamagotchi. You'd have to manage its water and phosphorus needs by doing some sort of trading or other gameplay. I can't really remember how far we got with the design.

Plant care
Plant care

I did enjoy making this wilting animation though:

Plant wilting
Plant wilting

Physics-based food thing

I honestly can't remember what the concept was for this. The most I can recall is that we discussed a system where you could rapidly click on some food or raw material objects to create derivative objects (such as beef and milk from a cow) and that somehow we'd connect that to the relative use of phosphorus in these products. For example, a cow requires a lot of feed which requires a lot of phosphorus, which results in a less efficient phosphorus-to-calorie ratio than if you had just eaten the feed grains yourself. I think I was really just excited about making something physics-based.

Picking up objects
Picking up objects
Tapping on objects for derivatives
Tapping on objects for derivatives
Bouncing around
Bouncing around

I'll definitely use this again for a different project.


The Infinite Card Game

  • 02.28.2018
  • etc

Kira and I were in Australia most of last month, and near where we were staying in Melbourne was a game shop. We had a free Friday night so I stopped by for my first Magic: The Gathering (MTG) draft event, and it got me thinking about designing card game systems.

MTG is a collectible card game with a great deal of strategic depth. Games with large state spaces like Chess and more recently Go have been more-or-less "solved"1; The state space of MTG is certainly orders of magnitude larger than Chess and Go, given the massive back catalog of cards (going back to 1993!)2 and the ever-growing number of interactions between them. Though the state space of Starcraft is likely larger (and people are working on "solving" it), to my knowledge MTG has not yet been solved in this sense.

For those unfamiliar with MTG, it's played between two or more players and involved constructing a deck of cards around a particular strategy. Some strategies may emphasize fast, aggressive plays ("aggro") which, if failing to win quickly, lose steam in longer matches. Others may focus on slowing opponents down by stopping plays short or making actions more expensive ("control"). And there are other strategies still.

MTG has a variety of game formats which govern how decks are constructed and can affect other game rules. These formats are broadly divided into Constructed and Limited formats. Constructed formats are where players carefully design and assemble their decks in advance. This gives plenty of space for creative, expressive strategies since players have a large pool of cards to select from. In contrast, Limited formats mean that players are given a small amount of random cards drawn from a set of cards and need to assemble a deck on-the-spot (a process called "drafting").

I've mostly played Constructed formats, but now that I've tried Limited a bit more I'm coming to prefer the randomness and uncertainty of Limited formats. In Limited you have to think on your feet more, design your deck more delicately (you aren't sure what to expect from your opponents), and work within a tighter set of constraints. It makes for more challenging and exciting games.

The problem with Limited is that each set has roughly 200-300 cards. After a few games you'll be familiar with every card and players have learned the strategies that work best within that set. Games start to get formulaic and stale. It loses that sense of uncertainty that makes Limited exciting in the first place. It isn't until the next set is released, with new cards and abilities, that things are interesting again.

These sets are carefully designed such that the cards have enough variation to keep things interesting, but not so much that they're totally incoherent (Mark Rosewater, the lead designer of MTG, has a great podcast delving into this design process). And they are meticulously balanced so no strategy is strictly better than any others.

That being said...I wonder if there's a way to design an infinite set, i.e. a dynamic and self-adjusting process which outputs a stream of cards to draw from for a never-ending Limited format. Such a system would need some rule scaffolding or framework (doesn't have to be MTG's) from which it can derive new mechanics and costs (some quantifier of their power), and then generate a balance of cards over some probability distribution.

For example, a core mechanic in MTG is that you have creatures that can attack opponents to damage them. Players can use their own creatures to "block" opponents' attacking creatures. These creatures have some cost to play ("cast") them; generally stronger creatures have the drawback of costing more to cast. Sometimes they may have other abilities which make them more or less versatile, which is compensated by a respective increase or decrease in casting cost. Sometimes you have creatures which are disproportionately cheap in casting cost for their strength, but these are rare.

Let's say the game for this infinite draft system has just this simple attacking-creature mechanic. Our creatures have only a strength and a casting cost. Generally, the stronger the creature, the higher it's casting cost. But not always -- on rare occasions we might have a strong creature that's a bit cheaper than normal. Finally, we add the additional constraint that weaker creatures are more likely, so we emphasize strong creatures as a more notable event.

What we're essentially saying is that the casting cost of a creature is dependent on its strength (and vice versa), which we can represent as a simple Bayes net:

G strength strength casting_cost casting_cost strength->casting_cost

When we want to create a new card, we can first sample its strength, then sampling a casting cost depending on the value we sampled for its strength.

We want a creature's strength to be a positive integer, say in the range $[1, 12]$. So we want a discrete probability distribution with finite support. We could use a Beta-binomial distribution, e.g. $\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$, which has the properties we want:

$\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$
$\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$

Here creatures will tend to have a strength somewhere in $[1, 4]$ and very rarely above 6. Then we can do something similar with casting cost, except that it's dependent on the strength.

This is an extremely simple game and so not a very interesting one. We'd want to add in additional abilities that interact in interesting ways. For example, in MTG the "flying" ability makes a creature blockable only by other creatures with flying. So we can add in some small probability of a creature gaining flying, and have that also affect the casting cost's distribution.

MTG's Flying mechanic

A really nice version of this system is one where you can pass in an arbitrary network relating costs and abilities (a more complex example of the one above), and it would output card descriptions in some interchange form (e.g. JSON), and you can use that to print cards with whatever design you wanted.

A few years ago I prototyped a similar system of cost-based card generation, which incorporated different card types beyond creatures (which I called "units"), additional abilities, and procedurally-generated names.

An example generated card
An example generated card

This prototype doesn't incorporate intra-card balance beyond what falls out of cost-balanced cards. The relative effectiveness of various abilities are really hard to objectively quantify, since their costs are really relative to what the dominant strategies are, i.e. the metagame. So ideally this infinite draft system not only generates balanced cards but also tracks how people are playing them to learn which strategies seem over or underpowered, and correspondingly tweaks the costs of abilities related to those strategies.

The generation features I just described are more about balance but another interesting feature would be introducing a balance-drift so that gameplay never stagnates in an equilibrium of strategies. Perhaps once balance is achieved the system can gradually and temporarily bias the game towards different strategies to encourage different kinds of gameplay. That way there'd be an ebb-and-flow that keeps things interesting in the aggregate and subtly changing the overall feeling of the game.

For example, if the system sees that players almost exclusively play low-strength, cheap creatures, and almost no larger creatures, maybe it will start to slightly cheapen the larger creatures so they see more play. That in turn may cause a new strategy to dominate, maybe a slower gameplay with larger creatures, and eventually a different change would be introduced to shock the system to a different strategy equilibrium.

I've given a very hand-wavy outline of this system here and as described it by no means would match MTG's hand-designed depth and complexity. But I do like the idea of a general system where you input some mechanics and it outputs a series of cards to play a game with. You could, perhaps, model many different systems via a card game format this way.



  1. Not solved in the proper sense, but human players are reliably bested by computer players. 

  2. Based on mtgjson.com's data, there are 18,191 unique cards as of Jan 21, 2018. 


Public transit routing

For the transit demand model for the PolicySpace project (see previously) we needed to be able to ingest GTFS (General Transit Feed Specification) data and use it to generate public transit routes. That is, given a start and an end stop, what's the quickest route using only public transit? Ideally we minimize both travel time and number of transfers, or strike some balance between the two.

I found an existing implementation here, based off of the papers Trip-Based Public Transit Routing (Sascha Witt, 2015) and Trip-Based Public Transit Routing Using Condensed Search Trees (Sascha Witt, 2016). The routing worked fine, but processing the Belo Horizonte GTFS data (630 routes with 9,428 stops, including buses and metro) took over 12 hours. The routing itself did not seem to have a significant speed gain from this pre-processing (unfortunately I don't have specific numbers on hand).

Instead I spent the past few days implementing my own public transit routing algorithm. It's a first attempt, and so it's naive/not heavily optimized, but so far seems to work alright. The goals were to significantly reduce pre-processing time, have a relatively low memory footprint, and quickly produce routes.

First, it's worth explaining a bit about how GTFS data is structured and the terminology it uses.

Public transit has a few components:

  • stops: same as the colloquial usage; a 🚏stop is somewhere a bus, train, etc stops to pick up and drop off, described with an id and a latitude and longitude.
  • trips: a 🚌trip is a sequence of stops associated with a set of arrival and departure times. For example, trip 🚌A goes through stop 🚏X at 12:00, then stop 🚏Y at 13:00, then stop 🚏Z at 14:00. Trip 🚌B goes through stop 🚏X at 13:00, then stop 🚏Y at 14:00, then stop 🚏Z at 15:00. Even though they have the same stop sequence (🚏X->🚏Y->🚏Z), they arrive/depart from each stop at different times, so they are distinct trips.
  • route: a route is a collection of trips. A route's trips do not necessarily mean they share the exact same sequence of stops. For example, a route may have a trip that goes 🚏X->🚏Y->🚏Z and another trip that only goes 🚏X->🚏Y. (I didn't use this data and it's confusing because we're using the term "route" to describe something different. This is the last you'll see this particular usage in this post.)
  • service: a service associates routes with days of operation (e.g. weekdays, weekends, off or alternative schedules on holidays, etc).

With these terms defined, we can define a public transit "route" more specifically as "a sequence of trips and their transfer stops". This is a bit more specific than how we describe routes conversationally, which is more like "take the Q line and transfer to the M at Foo Station". Instead, what we're after is more like "take the 12:42 train on the Q line, transfer to the 13:16 M at Foo Station".

GTFS data includes a few files (which are essentially CSVs but for whatever reason use the txt extension); the important ones here are:

  • stop_times.txt: describes stop sequences for trips, as well as arrival and departure times for each trip's stops.
  • stops.txt: describes stop latitudes and longitudes

We use some of the other data files as well, but those are more for juggling stop and trip ids and less core to the routing algorithm.

Transfer network: structure

The general approach is to generate a transfer network which describes at what stops are various trips linked together (i.e. where transfers can be made). The nodes of this network are described as (trip_id, stop_id) tuples. Note that not all stops of a trip will be nodes in the network; only those where transfers are made. Our resulting route only needs to return trips and where to transfer, not the individual stops within trips. That's something we can easily lookup later using the stop_times.txt data if needed.

The edges of this network can describe two different things:

  • inter-trip edges: If an edge connects two nodes which share a trip_id, e.g. (🚌A, 🚏X)->(🚌A, 🚏Y), it indicates in the network structure that these nodes are along the same trip. Here the edge weight indicates travel time between stops 🚏X and 🚏Y for trip 🚌A.
  • transfer edges: If an edge connects two nodes which don't share a trip_id, e.g. (🚌A, 🚏X)->(🚌B, 🚏X) it indicates a transfer between trips, e.g. transfer between trips 🚌A and 🚌B at stop 🚏X, and the edge weight indicates estimated transfer time.

We also need to distinguish between direct and indirect transfers:

  • a direct transfer is one where the two connected trips share a stop. For example, (🚌A, 🚏X)->(🚌B, 🚏X) is a direct transfer because these trips both share the stop 🚏X.
  • an indirect transfer is one where the two connected trips do not share a stop, but there is a reasonable footpath between the stops (i.e. you could walk between stops under some threshold transfer time limit, e.g. 5 minutes). For example (🚌A, 🚏X)->(🚌B, 🚏Y) where stop 🚏Y is around the corner from stop 🚏X. By introducing these additional edges, we potentially find shorter routes or routes where there normally wouldn't be one.

The transfer network is directed graph. This is due to the forward arrow of time. For example, trip 🚌A arrives at stop 🚏X at 12:10 and the 🚌B departs from stop 🚏X at 12:20, which indicates a valid transfer from trip 🚌A to 🚌B. However, I can't make the transfer in reverse; if I'm on trip 🚌B and arrive at 🚏X, trip 🚌A has potentially already departed from 🚏X. If the situation was that trip 🚌A lingers at 🚏X for long enough that the transfer does work in the reverse direction, we'd just have another edge (🚌B, 🚏Y)->(🚌A, 🚏X).

Transfer network: generation

The pre-processing step is almost entirely the generation of this transfer network. Once we have this network it's relatively easy to generate routes with Dijkstra's algorithm.

The generation process is broken into three parts:

  1. Generate direct transfer edges
  2. Generate indirect transfer edges
  3. Generate inter-trip edges

Direct transfer edge generation

We can use the stop_times.txt file to identify direct transfers. We group rows according to stop_id and look at the trip_ids under each group. However, we can't just create edges between all these trips here; we have to filter in two ways:

  1. down to valid transfers (those that obey the arrow of time). As noted before, a trip 🚌A can only transfer to a trip 🚌B at stop 🚏X if trip 🚌A arrives at 🚏X before trip 🚌B departs.
  2. for each equivalent trip, select only the soonest to the incoming trip.

Expanding on 2): we assume that travellers want to make the soonest transfer possible from among a set of equivalent trips. Equivalent trips are those that share the same stop sequence, regardless of specific arrival/departure times. For example: a trip 🚌A with stops 🚏X->🚏Y->🚏Z and a trip 🚌B with stops 🚏X->🚏Y->🚏Z share the same stop sequence, and so are equivalent.

Say we have a trip 🚌C that also stops at stop 🚏X, arriving at 12:10. Trip 🚌A departs from 🚏X at 12:20 and trip 🚌B departs from 🚏X at 12:40. We only create the edge (🚌C, 🚏X)->(A, 🚏X) and not (🚌C, 🚏X)->(🚌B, 🚏X) because we assume no person will wait 20 extra minutes to take 🚌B when 🚌A goes along the exact same stops and departs sooner.

This assumption greatly reduces the number of edges in the network. With the Belo Horizonte data, this assumption cut edges by about 80%, roughly 18.5 million fewer edges!

Indirect transfer edge generation

To generate indirect transfers, we use the following process:

  1. generate a spatial index (k-d tree) of all stops
  2. for each stop 🚏X, find the n closest stops (neighbors)
    1. for each neighbor 🚏Y:
      1. compute estimated walking time between 🚏X and 🚏Y
      2. create edges between trips going through 🚏X and 🚏Y, sticking to the same constraints laid out for direct transfers (valid and soonest equivalent transfers)

The network edge count and processing time vary with n. Increasing n will, of course, increase edge count and processing time. I've been using n=3 but this will likely vary depending on the specifics of the public transit network you're looking at.

Generate inter-trip edges

Finally, we generate edges between nodes that share a trip. We link these nodes in their stop sequence order. For example, for a trip 🚌A with stop sequence 🚏X->🚏Y->🚏Z, where 🚏X and 🚏Z are both transfer stops, we should have a edge (🚌A,🚏X)->(🚌A,🚏Z). (🚌A,🚏Y) isn't in the graph because it isn't a transfer node.

Trip routing

Once the transfer network is ready we can leverage Dijkstra's algorithm to generate the trip routes. However, we want to accept any arbitrary start and end stops, whereas the nodes in our network only represent transfer stops. Not all stops are included. We use the following procedure to accommodate for this:

  1. Given a starting stop 🚏s and departure time 🕒dt, find all trips 🚌T_s that go through 🚏s after 🕒dt.
  2. Given a ending stop 🚏e, find all trips 🚌T_e that go through 🚏e after 🕒dt.
  3. If there are trips in both 🚌T_s and 🚌T_e, that means there are trips from start to finish without any transfers. Return those and finish. Otherwise, continue.
  4. For each potential starting trip 🚌t_s in 🚌T_s, find the transfer stop soonest after 🚏s in 🚌t_s (note that this could be 🚏s itself) to be an entry node into the transfer network. We'll denote these nodes as ⭕N_s. For example, if we have trip 🚌A with stop sequence 🚏X->🚏Y->🚏Z and 🚏X is our starting stop, but not a transfer stop (meaning it's not in our transfer network), and 🚏Y is a transfer stop, we get the node (🚌A, 🚏Y). If 🚏X is a transfer stop, then we just get (🚌A, 🚏X).
  5. For each potential ending trip 🚌t_e in 🚌T_e, find the transfer stops closest to 🚏e in 🚌t_e (which could be 🚏e itself). We'll denote these nodes as ⭕N_e. This is just like step 4 except we go in reverse. For example, if we have trip 🚌A with the stop sequence 🚏X->🚏Y->🚏Z and 🚏Z is our starting stop but not a transfer stop, and 🚏Y is a transfer stop, then we get the node (🚌A, 🚏Y).
  6. Create a dummy start node, ⭕START, and create edges from it to each node in ⭕N_s, where the edge weight is the travel time from stop 🚏s to that node.
  7. Create a dummy end node, ⭕END, and create edges from each node in ⭕N_e to it, where the edge weight is the travel time from the node to stop 🚏e.
  8. Use Dijkstra's algorithm to find the shortest weighted path from ⭕START to ⭕END. This path is our trip route.

Conclusion

On the Belo Horizonte data (630 routes with 9,428 stops) with n=3 for indirect transfer generation we get a graph of 239,667 nodes and 4,879,935 edges. I haven't yet figured out a way to further reduce this edge count.

The transfer network generation takes about 20 minutes on my ThinkPad X260. I'm not certain a direct comparison is appropriate, but this is significantly faster than the 12 hours the other implementation took. The actual route generation doesn't seem any faster or slower than the other implementation.

While the routing is fairly fast, it definitely could be faster, and is not fast enough for direct usage in PolicySpace. We won't be able to compute individual trip routes for every one of Brazil's ~209 million inhabitants, but we were never expecting to either. This trip routing feature will be used by some higher-level interface which will cache routes and batch compute more approximate routes between regions (called "transportation analysis zones" in the transit demand modeling literature) to cut down on processing time. We're still figuring out specifics.

Anyway, this is version 1 of this trip routing method -- I'd appreciate any feedback and suggestions on how to improve the approach. I need to do more testing to get a better sense of the generated routes' quality.

The code for this implementation at time of writing is available here. There are a couple names different in the code than described here, e.g. the transfer network is called a "trip network", but the algorithm itself should be the same.