Research Tools

  • 10.31.2019
  • etc

My fellowship at the Jain Family Institute involves a lot of reading and research across various sources: typically papers in the form of PDFs, articles on my phone browser, and ePub books.

Prior to the fellowship the research I did was spontaneous and infrequent enough that my ad hoc system of typing out highlights by hand into a one or more markdown files and sorting through them manually worked alright. But now that I'm dealing with many, many more pages of research, this system has become kind of unwieldy.

Here I'll go over the tools I've built over the past couple years (most within the past year) to make this process more manageable. They are mostly meant to address a few key pain points:

  • Aggregating highlights from across several platforms (phone, desktop, and eReader) into a central repository
  • Capturing charts and other graphics
  • Tagging and organizing highlights
  • Generating footnotes

These tools are still new—I'm still testing and tweaking them—so they don't quite link up as well as they should. Eventually they'll likely be grouped into a single tool, but that might not be for awhile.


Basically all of my reading happens on either my phone (Android), my laptop, or my eReader (a Kobo). Documents are either web articles (phone and laptop), PDFs (laptop1), or ePubs (eReader). I'll occasionally print out PDFs, especially if I'm having a hard time focusing on the laptop.

The last pieces of the setup are vim, which I use to take notes and save highlights, and Firefox (though everything here would work fine for Chrome/Chromium).


The eReader is the most straightforward. Kobo readers have a built-in function for highlighting and annotating text. I haven't really tested if this works for images though.

The highlighting and annotating functions work smoothly for the proprietary Kobo .kepub format, which books from the Kobo store come in. For .epub books, the highlighting is atrociously slow and finicky. Fortunately, .epub files can be converted to .kepub with kepubify.

Kobo eReaders store highlights and annotations in a SQLite database. I have a script which exports rows from this database to a JSON file, and another script that lets you easily mount and run this script when you plug your Kobo into your laptop.

Web articles

Copy and paste

On the desktop, saving text web articles are also straightforward: copy and paste. When text from a web article is copied, it's saved in your system clipboard as both plain text and as HTML. When pasting into vim, only the plain text is pasted. Sometimes it's helpful for me to have the HTML pasted—to preserve links, for example—but as markdown instead (since my notes are in markdown files).

I have a tool called nom I developed to make managing markdown files easier. One function it has is nom clip, which takes HTML saved in the system clipboard, converts it to markdown, and then prints it out.

In my vim configuration file for markdown files (~/.vim/ftplugin/markdown.vim) I bound <leader>c to call this function, prepend > to each line (which defines a blockquote in markdown), and then output it into vim2:

" easily paste html clipboard content as quoted markdown
function! PasteQuotedHTML()
    augroup AsyncGroup
        autocmd User AsyncRunStop normal P
    augroup END
    call asyncrun#run("!", "", "nom clip | sed 's/^/> /' | xsel -bi")
nnoremap <leader>c :call PasteQuotedHTML()<cr>
Converting clipboard HTML to markdown
Converting clipboard HTML to markdown


Highlighting on phones is less straightforward. I briefly tried copying highlights into an email, sending it to myself, then copying it into a markdown file when I got to my laptop, but this ended up being too complicated and unpleasant.

I wrote a web extension called hili (for Firefox, but should work for Chrome in theory) that streamlines this. With hili, you select text or an image on a webpage, select "Highlight", and then a tag prompt comes up. Enter the tags and save the highlight, and the selected text or image data, along with some metadata about the page (url, title), is sent to a server that appends the data to a JSON file. The goal here was a very fire-and-forget, uninterrupting way of highlighting that is easy to search through later.

hili desktop demo
hili desktop demo

One snag is that this requires a server to communicate with. I run the application on my personal server, but I'm often reading on the subway where there's only an intermittent internet connection. For awhile this meant I'd be reading an article and come across something I'd want to save, but have to wait until the next subway stop for the internet connection to come back. Now hili queues highlights locally (indicated by the yellow box below) until an internet connection is detected, at which point the local highlights are synchronized with the server.

hili mobile demo
hili mobile demo

hili also helped with an anxiety of mine: I'd procrastinate reading articles that sounded especially interesting because I was worried about forgetting its contents. With hili I can save highlights without much thought, and this is less of a worry.


PDF is a horrible format. It's really not meant for reading on computers...but we do it anyways. Copy and pasting from text from a PDF will result in weird line breaks, there are often broken words (hyphenated) because the text is frequently justified, and is not particularly precise in selecting the text you want.

For a long time I'd either 1) copy the PDF text and manually fix line breaks, broken words, and delete extraneous text; or 2) manually type in whatever I wanted to highlight. Even though it probably takes longer, I usually went with the latter—it just felt faster than jumping around correcting little errors.

Fortunately, the copy-and-paste errors are consistent enough that a lot of the cleanup can be automated, which is what this script does. This feels small, but ends up being a huge time saver. Like with HTML pasting, I added a function to my vim configuration (same file as before), bound to <leader>d, which automatically applies this processing to text in the system clipboard:

" easily paste pdf clipboard content as quoted markdown
function! PasteQuotedPDF()
    augroup AsyncGroup
        autocmd User AsyncRunStop normal P
    augroup END
    call asyncrun#run("!", "", "xsel -b | ~/.bin/pdfpaste | sed 's/^/> /' | xsel -bi")
nnoremap <leader>d :call PasteQuotedPDF()<cr>

It's not perfect though. The heuristics used to clean up the text can hit false positives, as in the demo below ("European-derived" is collapsed to "Europeanderived" because the script takes all hyphenations at the end of lines to be word breaks).

Automatically formatting pasted PDF text
Automatically formatting pasted PDF text

Some PDFs are scans of books or articles. These are basically just images, so highlighting text usually doesn't work with them. If you're lucky, the scans will have been OCR'd and there will be selectable text. However, the OCR quality can vary; sometimes you'll copy in PDF text and have to manually correct errors, which can be tedious. Still, better than manually typing in everything.

In any case, what's to be done for graphics in PDFs? I have a simple script for taking screenshots which dumps them into a folder (~/docs/shots/). I added another function to my vim configuration (same file as before) that automatically moves this to a folder called assets (relative to the current markdown file) and then drops in the markdown syntax for the image:

" screenshot, move to assets folder, paste in markdown
nnoremap <leader>s "=system("fpath=$(shot region <bar> tail -n 1); fname=$(basename $fpath); mv $fpath assets/$fname; echo '![](assets/'$fname')'")<CR>P
Capturing a screenshot and auto-pasting markdown
Capturing a screenshot and auto-pasting markdown

And finally, another vim function lets me easily view the image with gx:

" open markdown syntax urls
" open local images with feh
" open local gifs with gifview
function! OpenUrlUnderCursor()
    let l:lnum = line('.')
    let l:line = getline(l:lnum)
    let l:coln = col('.')

    let l:lcol = l:coln
    while l:line[l:lcol] != '(' && l:line[l:lcol] != '<' && l:lcol >= 0
        let l:lcol -= 1

    let l:rcol = l:coln
    while l:line[l:rcol] != ')' && l:line[l:rcol] != '>' && l:rcol <= col("$")-1
        let l:rcol += 1

    let l:obj = l:line[l:lcol + 1: l:rcol - 1]
    let l:url = matchstr(l:obj, '\(http\|https\):\/\/[^ >,;]*')
    let l:img = matchstr(l:obj, '[^<>()]\+\.\(jpg\|jpeg\|png\|gif\)')
    if l:url != ''
        call netrw#BrowseX(l:url, 0)
    elseif l:img != ''
        if matchend(l:img, 'gif') >= 0
            silent exec "!gifview -a '".l:img."'" | redraw!
            silent exec "!feh --scale-down '".l:img."'" | redraw!
        echomsg 'The cursor is not on a link.'
nnoremap gx :call OpenUrlUnderCursor()<cr>
Quickly viewing a saved image
Quickly viewing a saved image

Tagging and sorting

The final piece is the most recent addition: a tool for quickly going through highlights and tagging bits of text. When I'm in the middle of research, I'm using dumping pretty large chunks of text into my notes file, without worrying too much about annotating or tagging it. Early on in the research I'm often not really sure what's relevant, or how a particular highlight fits into the broader topic or questions I'm interested in.

At the end, however, I'm left with hundreds of pages of text chunks that I need to go through and sort. I used to go through all of these highlights and repeat the copy-and-pasting process, just with smaller chunks of text organized into tag groups. It was an extremely slow process.

Now I have grotto3, which renders these notes as HTML, and provides a very quick interface to select and tag text and images. This information is all saved into a CSV file that can easily be processed later.

As a bonus, since the tags and annotations are well-structured, grotto can automatically organize these selections into an outline (well, tag groupings) and generate markdown footnotes.

grotto demo
grotto demo

Future work

Right now these various tools don't really talk to each other, and they dump their data in different locations and in somewhat different formats. In the future these formats would be harmonized and another tool would be built to quickly explore and search through them. Until then, it's a manual process.

I've also been curious to try mind maps as a way to organize research. grotto could help with this by auto-generating a mind map based on tag co-occurrence, but I haven't given it a try yet. Even so, there's something to be said for the tactile experience of organizing information on paper, so an automatic system might not be ideal.

Honorary mention: signal-daemon


A tool that isn't directly related to highlighting or annotating text but is worth mentioning is signal-daemon, which is a very simple script that lets you "text" notes to another number you register with Signal (e.g. a Google Voice number), and then downloads those texts to a markdown file on your computer. I use this to write down random thoughts, in general, but also on whatever topic I'm researching.

  1. I tried reading PDFs on my phone and found it too straining. I did however put together a script that extracts highlights and annotations for MoonReader, available here

  2. This uses the asyncrun plugin. 

  3. There were a lot of challenges setting up the tag highlighting interactions, mostly to do with the DOM model having a tree structure and edge cases involving overlapping or nested tag highlights...happy to go more into it if it's of interest to anyone. 

Getting in front of pharma: Automated public discovery of drug candidates

A couple weeks ago Sean and I were fortunate enough to participate in another edition of Rhizome's 7x7, this time in Beijing in collaboration with the Chinese Central Academy of Fine Arts (CAFA). I was very excited to collaborate with Sean again after our first collaboration in New York at the New Museum, and to have the chance to try something different together.

We thought about revisiting our previous project,, which was a proposal for a cryptocurrency/distributed computing system for which the proof-of-work protocol involved computing simulation updates for an atomic-level model of a human cell (though our proposal initially suggested simulating a ribosome). Such detailed simulation of biological processes would be a boon for medical research, but simulating even the simplest cell at that resolution is so computationally demanding that it's infeasible even for the world's best supercomputers. But the aggregate computing power of the Bitcoin network is orders of magnitudes larger than any supercomputer, and might be able to run such a model in a reasonable amount of time. By adopting that model for in silico cells, a crucial part of medical research is essentially collectivized, and as part of our design, so too are the results of that research. The project bears similarity to Folding@Home and its crypto-based derivatives (e.g. FoldingCoin), but as far as I know none of these projects explicitly distribute ownership of the research that results from the network. There were also some design details that we didn't have time to hash out, and we left open a big question of computational verifiability: given a simulation update from a node, how can you be certain that they actually computed that value rather than returned some random value? (Golem has this problem too, the difficulty of which is discussed a bit here).

This time around, rather than a project about medical research abstractly, we focused specifically on the pharmaceutical industry, the 1.1 trillion dollar business lying at the nexus of intellectual property law, predatory business practices, and the devaluing of human life.

The pharmaceutical industry

“Is curing patients a sustainable business model?” Goldman Sachs analysts ask
“Is curing patients a sustainable business model?” Goldman Sachs analysts ask

(For background I'm going to lean heavily on the "Pill of Sale" episode of the Ashes Ashes podcast which goes into more detail about the pharmaceutical industry — definitely worth a listen.)

Most Americans are familiar with exorbitantly-priced drugs — if not directly than via one of the many horrifying stories of people crowdfunding their continued existence or flying elsewhere to access more reasonable prices. A hepatitis C cure from Gilead, Solvadi, costs $84,000 for a 12-week course and is the subject of a recent Goldman Sachs report. The report describes cures as effective as Solvadi (up to 97%) as bad for business since you cure yourself out of a market. Even something as common insulin can cost a significant portion of income — to the point where people die from needing to ration it.

This hostile environment is thinly justified with rhetoric around drug development costs and enforced through the patent law system, all under the implicit, sometimes explicit, assumption that it is necessary for drug companies to make a profit on their drugs. Patents provide exclusive rights for a company to sell a particular drug; this temporary monopoly essentially gives them carte blanche to set whatever price they want so that they recoup the drug development costs, so the story goes. These patents last 20 years and can basically be extended by "exclusivity" periods which add up to another 7 years. A drug may take 10-15 years to develop, leaving a window of at least 5 years of exclusive rights to produce and sell it. "Orphan drugs", drugs that treat rare conditions, may have longer monopolies to compensate for the smaller market size. After this period generics are permitted to enter the market, which drives the cost down, but there are all sorts of tricks available that can prolong this protection period even further, a practice called "evergreening". For example, slightly modifying how the drug is delivered (e.g. by tablet or capsule) can be enough for it to essentially be re-patented.

(It's worth noting that prices can be high even for generics. For example, epinephrine — commonly known as an EpiPen, essential for severe allergic reactions — can be bought for about 0.10-0.95USD outside the US, whereas generics in the US can cost about $70.)

Drug development pipeline. From: Pharmaceutical Research and Manufacturers of America, Drug Discovery and Development: Understanding the R&D Process,
Drug development pipeline. From: Pharmaceutical Research and Manufacturers of America, Drug Discovery and Development: Understanding the R&D Process,

Drug development is expensive, averaging at over $2.5 billion per drug, and that's only counting for those that gain FDA approval. However, these exclusivity rights are not merely used to recapture R&D costs, as is often said, but instead to flagrantly gouge prices such that the pharmaceutical industry is tied with banking for the largest profit margins of any industry (as high as 43% in the case of Pfizer).

The narrative around high drug development costs also takes for granted that pharmaceutical companies are the ones bearing all of these costs. A considerable amount of the basic research that is foundational to drug development is funded publicly; the linked study found that public funding contributed to every drug that received FDA approval from 2010-2016. The amount of funding is estimated to be over $100 billion.

It used to be that inventions resulting from federal funding remained under federal ownership, but the 1980 Bayh–Dole Act offered businesses and other institutions the option to claim private ownership. The result is the public "paying twice" for these drugs. The Act does preserve "march-in rights" for the government, allowing the government to circumvent the patent and assign licenses independently if the invention is not made "available to the public on reasonable terms", but as of now these rights have never been exercised. In 2016 there was an unsuccessful attempt to use these march-in rights to lower the price of a prostate cancer drug called Xtandi, priced at $129,000/year.

All of this isn't to say that the work of the pharmaceutical industry isn't valuable; drugs are a necessary part of so many peoples' lives. I recently started using sumatriptan to deal with debilitating migraines, and am hugely grateful it exists (and is not ridiculously expensive). It's because pharmaceuticals are such a critical part to life that their development and distribution should not be dictated the values that currently shape it.

One particularly egregious example of this mess is the nightmare scenario of Purdue Pharmaceuticals, owned by the Sackler family (who are also prolific patrons of the arts), producers of OxyContin (accounting for over 80% of their sales last year), basically responsible for the ongoing opioid crisis (affecting at least 2.1 million Americans directly, and many more collaterally), and recently granted a patent for a drug that treats opioid addiction. The patented treatment is a small modification of an existing generic.

The day before our 7x7 presentation a story broke in the Guardian: "Sackler family members face mass litigation and criminal investigations over opioids crisis".

Computational drug discovery

One reason drug development is so difficult is that the space of possible drug compounds is extremely large, estimated to be between 1060 and 1063 compounds. For comparison, there are an estimated 1022 to 1024 stars in the entire universe, and according to this estimate about 1049-1050 atoms making up our entire world.

PubChem's chemical space, from "Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database"
PubChem's chemical space, from "Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database"

Drug development is in large part a search problem, looking to find useful compounds within this massive space. A brute-force search is impossible; even if it took only a couple seconds to examine each possible compound you'd see several deaths of our sun (a lifespan of about 10 billion years) before fully exploring that space.

More effective techniques for searching this space include slightly modifying existing drugs for different therapeutic applications ("me-too" compounds) and literally looking at plants and indigenous medical traditions for leads (this general practice is called "bioprospecting" and this particularly colonialist form is called "biopiracy").

Of course with the proliferation of machine learning there is a big interest in searching this space computationally. Two main categories are virtual screening (looking through known compounds for ones that look promising) and molecular generation (generating completely new compounds that look promising). We focused on molecular generation for reasons described below.

A primary goal for is to publicize this work in computational drug discovery and also help researchers use these generated compounds as potential leads for new beneficial drugs. With our system, which is also open source, independent and institutional researchers alike can access automated drug discovery technology and hopefully accelerate the drug development process.

Prior art and public discovery of drugs

(For this section we spoke with a patent lawyer who requested that we note that they are not representing us.)

One crucial criteria for a patent is that the invention must be novel; that is, the invention cannot have already been known to the public. An existing publicly-known instance of an invention is called "prior art" and can invalidate a patent claim. However, sufficient variations to an invention may qualify it as original enough to be patentable (this is the idea behind evergreening, described above).

If a drug is discovered and made public prior to a patent claim on it, it would function as prior art and make that compound un-patentable in its current form. If we were able to generate new molecules that could function as useful drugs, and make public those new molecules, then perhaps we can prevent companies from patenting them and maintaining a temporary monopoly on their distribution.

This is the second goal of develop an automated drug discovery system to find and publish useful drugs so they cannot be patented.

Additional efforts

Other efforts to the address problems with pharmaceutical industry can be found in initiatives like Medicare for All and the proposed Prescription Drug Price Relief Act, and the organizing happening around those. The issues with the pharmaceutical industry are just one piece of a more general hostility in American healthcare.

There is also a burgeoning DIY medicine movement which aim to build alternatives to industrialized medicine, providing autonomy, access, and reliability where those are normally withheld. For example, the artist Ryan Hammond is working on genetically modifying tobacco plants to produce estrogen and testosterone, and the Four Thieves Vinegar Collective (discussed in the Ashes Ashes "Pill of Sale" episode) provides instructions for a DIY EpiPen and a DIY lab ("MicroLab") for synthesizing various pharmaceuticals, including Naloxone and Solvadi.

That's it for the background of the project. The following section describes how the system works in more detail.

The project code is available here.

How it works

The complete system involves three components:

  1. A molecular generation model, using a version of the graph variational autoencoder ("JTNN-VAE") described in [2], modified to be conditional ("JTNN-CVAE"). This generates new compounds given a receptor and action, e.g. a nociceptin receptor agonist.
  2. An ATC code prediction model. The Anatomical Therapeutic Chemical (ATC) classification system categorizes compounds based on their therapeutic effects. We use this to estimate what a generated compound might treat.
  3. A retrosynthesis planner. Retrosynthesic analysis is the process of coming up with a plan to synthesis some target compound from an inventory of base compounds (e.g. compounds you can purchase directly from a supplier). This is necessary to meet the enablement requirement of prior art; that is, it's not enough to come up with a new compound, you also need to sufficiently demonstrate how it could be synthesized.

To train these various models we relied on a number of public data sources, including PubChem, UniProtKB, STITCH, BindingDB, ChEMBL, and DrugBank.

Common chemical compound representations, from Prediction methods and databases within chemoinfromatics: emphasis on drugs and drug candidates.
Common chemical compound representations, from Prediction methods and databases within chemoinfromatics: emphasis on drugs and drug candidates.

Chemical compounds can be represented in a number of ways, e.g. as a 2D structural diagram or a 3D model. One of the most portable formats is SMILES, which represents a molecule as an ASCII string, and is what we use throughout the project.


For training the JTNN-CVAE model we needed to cluster the compounds in a meaningful way. At first we didn't look at receptors and actions but rather tried to leverage the vast published chemical research literature (in PubMed) and USPTO patents.

For the first attempt we tried learning word2vec embeddings from PubMed article titles and abstracts, then representing documents using TF-IDF as described in the WISDM algorithm, and finally clustering using DBSCAN or OPTICS. This ended up being way too slow, memory intensive, and limited in what clustering algorithms we could try.

The second attempt involved generating a compound graph such that an edge exists between compound A and B if they appear in an article or patent together. So instead of linking compounds based on the content of the articles they're mentioned in, they're linked solely on the virtue of being mentioned together in an article, under the assumption that this indicates some meaningful similarity. Then we ran a label propagation community detection algorithm to identify clusters within the graph. The graph was fairly sparse however and in the end looking at the connected components seemed to be enough. There were still some limitations with speed and memory that led us to abandon that approach.

Finally we decided to cluster based on receptors compounds were known to interact with. This reduced the amount of compounds we were able to look at (since the compounds for which receptor interactions are known are much less than the total of all known compounds), but the data was richer and more explicit than co-mentions in text documents. A drug's effects are determined by the receptor it targets and how it interacts with that receptor (does it activate it, does it block it, etc). For instance, OxyContin is a mu-type and kappa-type opioid receptor agonist (it activates them), and Naloxone (used to treat opioid overdose) is a mu-type and kappa-type opioid receptor antagonist (it blocks them).

G-Protein Coupled Receptor, from Random42
G-Protein Coupled Receptor, from Random42

The ChEMBL data has information about both receptors and the type of interaction (agonist, antagonist, etc), whereas BindingDB has information about only the receptors but for more compounds. So we used a two-pass approach: for the first pass, we create initial clusters based on receptor-action types and on the second-pass we augment these clusters with the BindingDB compounds, assigning them to the cluster that matches their target receptor and has the highest fingerprint similarity to the cluster's current members. This resulted in 461 receptor-action clusters.

Molecular generation model

The JTNN-VAE model described in [2] is a variational autoencoder that handles molecular graphs. A variational autoencoder is a generative model that learns how to compress ("encode") data in such a way that it can be reliable decompressed ("decoded"). It accomplishes this by learning an underlying probability distribution that describes the data. Once the model has learned this distribution you can sample it to generate new data that looks like the old data. This post provides an overview on variational autoencoders.

We modified the model to be conditional, allowing us to sample from the learned probability distribution conditioned on the cluster (the receptor-action) we want to generate new compounds for.

ATC code prediction model

The ATC code prediction model is a straightforward multiclass neural network. Though ATC code prediction is technically a multilabel problem (a compound may have more than one ATC code), most compounds had only one code, so we treated it as a multiclass problem. ATC codes have 5 levels, from low detail to high detail; we predict level 3 codes ("pharmacological/therapeutic/chemical subgroup"), with one class for each level 3 code. We use 2048-bit Morgan fingerprints as representations for the compounds and achieved about 80% accuracy with this naive approach — not ideal, but fine for our purposes and time constraints.

Retrosynthesis planner

Here's where it came down to the wire. We attempted to implement the model ("3N-MCTS") described in [7], which involves Monte Carlo Tree Search (MCTS) and three policy networks. The policy networks predict what reaction rules might apply to a given compound. The paper's model is trained on Reaxys data, which is way too expensive for us, so we used a smaller dataset extracted from USPTO patents (from [26]). Our implementation is basically complete but we didn't have enough time to train the models.

In an 11th-hour Hail Mary we used MIT's ASKCOS system which got us mostly partial synthesis plans. At some point we should revisit the 3N-MCTS system to see if we can get that working.

Sampling and filtering

Once all the models were ready we sampled 100 new compounds for each of the 461 receptor-action clusters. Then we filtered down to valid compounds (with no charge) and to those not present in PubChem's collection of 96 million compounds. We ended up whittling the set down to about 15,000 new compounds for which we then predicted ATC codes and generated synthesis plans.

Future work

Time was fairly tight for the project so we didn't get to tune or train the system as much as we wanted to. And it would have been nice to try to experiment with more substantial modifications of the models we used. But for me this project was a wonderful learning experience and renewed an interest in chemistry. I want to spend some more time in this area, especially in materials science because of its relevance to lower-impact technologies, such as this cooling material.


  1. Botev, Viktor, Kaloyan Marinov, and Florian Schäfer. "Word importance-based similarity of documents metric (WISDM): Fast and scalable document similarity metric for analysis of scientific documents." Proceedings of the 6th International Workshop on Mining Scientific Publications. ACM, 2017
  2. Jin, Wengong, Regina Barzilay, and Tommi Jaakkola. "Junction Tree Variational Autoencoder for Molecular Graph Generation." arXiv preprint arXiv:1802.04364 (2018).
  3. Kusner, Matt J., Brooks Paige, and José Miguel Hernández-Lobato. "Grammar variational autoencoder." arXiv preprint arXiv:1703.01925 (2017).
  4. Goh, Garrett B., Nathan O. Hodas, and Abhinav Vishnu. "Deep learning for computational chemistry." Journal of computational chemistry 38.16 (2017): 1291-1307.
  5. Yang, Xiufeng, et al. "ChemTS: an efficient python library for de novo molecular generation." Science and technology of advanced materials 18.1 (2017): 972-976.
  6. Liu, Yue, et al. "Materials discovery and design using machine learning." Journal of Materiomics 3.3 (2017): 159-177.
  7. Segler, Marwin HS, Mike Preuss, and Mark P. Waller. "Planning chemical syntheses with deep neural networks and symbolic AI." Nature 555.7698 (2018): 604.
  8. Kim, Edward, et al. "Virtual screening of inorganic materials synthesis parameters with deep learning." npj Computational Materials 3.1 (2017): 53.
  9. Josse, Julie, Jérome Pagès, and François Husson. "Testing the significance of the RV coefficient." Computational Statistics & Data Analysis 53.1 (2008): 82-91.
  10. Cordasco, Gennaro, and Luisa Gargano. "Community detection via semi-synchronous label propagation algorithms." Business Applications of Social Network Analysis (BASNA), 2010 IEEE International Workshop on. IEEE, 2010.
  11. Community Detection in Python
  12. Jin, Wengong, Regina Barzilay, and Tommi Jaakkola. "Junction Tree Variational Autoencoder for Molecular Graph Generation." arXiv preprint arXiv:1802.04364 (2018).
  13. What are the differences between community detection algorithms in igraph?
  14. Summary of community detection algorithms in igraph 0.6
  15. Wang, Yong-Cui, et al. "Network predicting drug’s anatomical therapeutic chemical code." Bioinformatics 29.10 (2013): 1317-1324.
  16. Liu, Zhongyang, et al. "Similarity-based prediction for Anatomical Therapeutic Chemical classification of drugs by integrating multiple data sources." Bioinformatics 31.11 (2015): 1788-1795.
  17. Cheng, Xiang, et al. "iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals." Oncotarget 8.35 (2017): 58494.
  18. Szklarczyk D, Santos A, von Mering C, Jensen LJ, Bork P, Kuhn M. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2016 Jan 4;44(D1):D380-4.
  19. Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, Assempour N, Iynkkaran I, Liu Y, Maciejewski A, Gale N, Wilson A, Chin L, Cummings R, Le D, Pon A, Knox C, Wilson M. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2017 Nov 8. doi: 10.1093/nar/gkx1037.
  20. Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA, Wang J, Yu B, Zhang J, Bryant SH. PubChem Substance and Compound databases. Nucleic Acids Res. 2016 Jan 4; 44(D1):D1202-13. Epub 2015 Sep 22 [PubMed PMID: 26400175] doi: 10.1093/nar/gkv951.
  21. Gilson, Michael K., et al. "BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology." Nucleic acids research 44.D1 (2015): D1045-D1053.
  22. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45: D158-D169 (2017)
  23. Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, Mutowo P, Atkinson F, Bellis LJ, Cibrián-Uhalte E,
    Davies M, Dedman N, Karlsson A, Magariños MP, Overington JP, Papadatos G, Smit I, Leach AR. (2017)
    'The ChEMBL database in 2017.' Nucleic Acids Res., 45(D1) D945-D954.
  24. Papadatos, George, et al. "SureChEMBL: a large-scale, chemically annotated patent document database." Nucleic acids research 44.D1 (2015): D1220-D1228.
  25. Lowe, Daniel Mark. Extraction of chemical structures and reactions from the literature. Diss. University of Cambridge, 2012.
  26. Lowe, Daniel (2017): Chemical reactions from US patents (1976-Sep2016). figshare. Fileset. CC0 License.
  27. Liu, Bowen, et al. "Retrosynthetic reaction prediction using neural sequence-to-sequence models." ACS central science 3.10 (2017): 1103-1113.
  28. Klucznik, Tomasz, et al. "Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory." Chem 4.3 (2018): 522-532.
  29. Segler, Marwin HS, and Mark P. Waller. "Neural‐Symbolic Machine Learning for Retrosynthesis and Reaction Prediction." Chemistry–A European Journal 23.25 (2017): 5966-5971.
  30. Law, James, et al. "Route designer: a retrosynthetic analysis tool utilizing automated retrosynthetic rule generation." Journal of chemical information and modeling 49.3 (2009): 593-602.
  31. Schwaller, Philippe, et al. "“Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models." Chemical science 9.28 (2018): 6091-6098.
  32. Kipf, Thomas N., and Max Welling. "Semi-supervised classification with graph convolutional networks." arXiv preprint arXiv:1609.02907 (2016).
  33. Yang, Zhilin, William W. Cohen, and Ruslan Salakhutdinov. "Revisiting semi-supervised learning with graph embeddings." arXiv preprint arXiv:1603.08861 (2016).
  34. Kipf, Thomas, et al. "Neural relational inference for interacting systems." arXiv preprint arXiv:1802.04687 (2018).
  35. Wei, Jennifer N., David Duvenaud, and Alán Aspuru-Guzik. "Neural networks for the prediction of organic chemistry reactions." ACS central science 2.10 (2016): 725-732.
  36. Coley, Connor W., et al. "Prediction of organic reaction outcomes using machine learning." ACS central science 3.5 (2017): 434-443.
  37. Gupta, Anvita. "Predicting Chemical Reaction Type and Reaction Products with Recurrent Neural Networks."
  38. Plehiers, Pieter P., et al. "Automated reaction database and reaction network analysis: extraction of reaction templates using cheminformatics." Journal of cheminformatics 10.1 (2018): 11.
  39. MIT's ASKCOS ("Automated System for Knowledge-based Continuous Organic Synthesis")

Culture: A Social Network Simulator

This is a proposal for Culture, a social network simulator designed and developed to teach students about bot development.

This proposal was originally developed for a class on "news bots" I was scheduled to teach in the fall of 2017 (I ended up having a conflict and was unable to teach it). I wanted students to not only explore the impact of bots from a theory perspective, but also engage hands-on to see just how radically influential these bots are on social media platforms.

And not only bots. Ideally students would take on the role of other actors in social media ecosystems, such as a "traditional" media publication, or as an advertiser, or as a political candidate, or as a influencer, or even as the platform itself, making decisions around aspects such as the newsfeed algorithm.

Unfortunately, there are a number of challenges that make hands-on experience infeasible with live social networks:

  • Ethical concerns. For example, many bots are meant to deceive and manipulate, and we'd be working with real user data.
  • Issues of access. For example, rate-limiting and limited access to data. For privacy reasons APIs generally don't provide sensitive user data to developers, though some such data may be provided to advertisers. And of course, with live social networks there isn't a way for students to change the newsfeed algorithms for the entire network.
  • Limits of reality. For example, a student can't magically become an influencer on Twitter, but in a simulated setting, they can.

There are also some technical obstacles, namely that students taking the class weren't required to have any programming background and I didn't want to spend too much time on introductory programming lessons. Even if students were fairly experienced in programming, working with bots has a lot of advanced challenges, such as dealing with natural language. A simulated social network can be simplified so that these problems are easier to deal with.

This proposal doesn't really have a strong advertising component. After speaking with Irwin Chen about it, I realized it's a pretty big omission. So an updated proposal will include all of that: selling ads, ad targeting, ad exchanges, etc. It's not an area I know well, so I'd have to speak with some people and do some research before sketching that out.


Culture will be an agent-based simulation of a simple social network modeled off of Twitter. As such, the simulation will consist of the following (each part is elaborated further below):

  • users communicate in a rudimentary language
  • users have different personalities
  • each user will have a feed of messages from people they follow and include promoted/ad messages
    • here students can potentially design their own news feed algorithms and see how that affects individual/public opinion
  • users can message, post media, block, be blocked, be banned, follow, unfollow
  • messages and media influence users
  • the network responds to and affects outside events


So much of our exposure to and understanding of the world beyond our immediate experience is mediated by social networks, which is to say by newsfeed algorithms and other individual users of these networks. Students should develop a stronger literacy in these dynamics if they are to adequately navigate this information ecology.

This literacy is best developed by direct interaction with these social networks, such as Twitter or Facebook, rather than through theory alone. However, working directly with these networks may be impractical in that they are massive, closed-source, and limited in access. For instance, due to API limits it is impossible to survey or conduct analyses of the entire population of the network, or to examine in detail its inner operations.

Furthermore, there is no room for counterfactual speculation in these existing social networks. For example, we can't intervene and change the behaviors of all users and see how information propagation changes as a result. This limits the pedagogical value of working directly with, for example, Twitter or Facebook.

A simulated social network addresses these concerns. It can be designed to model the dynamics of its real counterparts, it can be entirely open in that students have access to all the network's data, and its parameters can be tweaked to see how information propagation evolves under different circumstances. Students can develop bots on this network without worrying about API limits, spam protection, and so on. In contrast to the black-box nature of a real social network, a simulated social network functions more like a sandbox.


The simulated agents are individual users of the social network. They are randomly generated to have particular personalities and interests (see below). Their generation is part of the simulation's initialization. Students do not directly interact with these agents, but can indirectly interact with them via, for example, ads and bots they create (see below).


Dealing with natural language is difficult even for experienced developers and advanced researchers in the topic. Broadly, the problem of natural language in the context of bots can described in two parts: understanding and generation. Both are very difficult and beyond the scope of the courses that this simulation is designed for, which includes introductory classes.

To avoid dealing with natural language, the simulation will consist of a very basic grammar and a relatively small vocabulary which can be easily expanded as needed. Because of its relatively simplicity, the same natural language processing techniques that are currently used for "real" languages can also be applied, but with greater success, and better yet, simpler heuristics will go a longer way. Thus students will not need to have a deep understanding of, for example, word vectors or TF-IDF, but may develop their own simpler techniques that will still be effective.

This simpler language will consist of verbs, nouns, and modifiers (adjectives and adverbs) (collectively, "terms"). Because the courses are assumed to be taught in English, this language will be reflective of English.

These terms are combined into formal propositional statements, e.g. single-payer-healthcare + country -> < freedom, which expresses the opinion that implementing single payer health care in this country will cause (->) a loss (<) of freedom. (This is just a sketch of the syntax; it's subject to change).

This is a bit limiting; there is no room for poetics, for instance, but will provide a strong starting point that can be expanded on later.

The design of this language will involve developing a network of terms (i.e. defining term associations), such that terms represent mixtures of other terms and values in the simulation (e.g. individuality/collectivism, see "Personalities" below). This term association network is opaque to the students; they do not get to see what these terms mean to the agents in the simulation. As with the real world, they must use algorithms or their own intuition from observing the network to determine what language best communicates their messages.

For example: the term "car" may be connected to the terms "individuality" and "freedom" to establish that the term "car" symbolically evokes these two ideas. We could then imagine ads for "cars" appeal more to agents with personalities that align more with those concepts relative to agents who, for example, align more with "collectivity" and "freedom".

Terms also have sentiment valences, e.g. "bad" may have a valence of -0.5 to express a negative opinion, whereas "terrible" may have a stronger valence of -0.8, and so on.

Ideally, this term association network is not objective but rather subjective; i.e. differs depending on the particular agent. For example, the term "freedom" may be associated with different values for one agent than for another. However, it is likely that this will be computationally infeasible (though some kind of heuristics could be developed to simplify it).

This term association network also changes over time as terms are used in slightly different contexts. This provides a way for the meaning of terms to change or be entirely inverted, e.g. a negative term being co-opted as a positive identifying term for a group.

The language is the part of the simulation that will require most care in designing - it needs to represent important aspects of how language is used in social networks (e.g. to express opinion/judgement, to harass/abuse, to make propositional statements, etc).


Simulated agents will have could loosely be described as "personalities"; that is, a set of parameters that determines how the agent interacts with others (e.g. aggressiveness/friendliness, within-bubble/outside-bubble, etc) and what their values are (e.g. conservative/progressive, individualist/collectivist, etc). These personalities will be generated randomly, via a Bayes Net (or some similar probabilistic model) that will be editable in some way. A model like a Bayes Net lets us describe assumed relationships between values (e.g. more collectivist agents are more likely to be friendly).

These personalities also determine who agents tend to interact with (under principles of homophily, i.e. like attracts like) and also what kind of messaging resonates with them (e.g. messages about rugged individuality will resonate more with individualist agents).


"Messages" are the equivalent of Twitter's tweets. Agents compose their own messages based on their personalities and who they are interacting with. Messages may affect an agent's mood and also their personality (see below).


Text is not the only important part of a social network - memes and other media (news stories, videos, etc) form a crucial part of their information flow.


Agents' states include their personalities, in addition to other attributes like mood and use frequency (how often they visit the social network) and post frequency (how often they post messages). Mood may affect, for example, how agents interact with other agents (e.g. with more or less hostility). This can be used to model emotion contagion.


Based on who they interact with and what other messaging they are exposed to (e.g. targeted ads), the personalities (traits/opinions) of an agent may shift over time. Various social phenomena, e.g. bipolarization, can be modeled here.


Social networks are not closed systems; they do not exist in isolation. The "outside" world affects what goes on in network, just as what goes in the network can spill out and effect the outside world.

Part of the simulation will support external events (also simulated) that affect and can be affected by the social network, such as an election. The outside event(s) affect what are popular topics (i.e. topics that are relevant and agents are more likely to talk about and respond to) and they can be defined to have some relationship to the shape of discourse in the network.

Social Network

In this section, "social network" is used not to refer to the platform itself, but to the actual network of relationships between users (expressed by "following" relationships). Some users may be highly connected (many followers), and students may, for example, as part of their strategy (whether for ads or opinion influence) try to target these opinion leaders.


It will likely be too computationally taxing to display all activity on the social network, but students will have access to various views that provide summaries (i.e. mean sentiment towards some topic, number of users talking about a topic, etc). Ideally an API can be provided like a real social network, so that students can build their own visualizations as part of their bot development process, but this may be limited by the size of the simulation.


A simple API will be provided for students to develop their own bots that interact with this network. These bots can follow, be followed, message, etc like agents can and will be the primary way students interact with the social network.

What distinguishes bots from simulated agents is that bots are designed and controlled by students, whereas the simulated agents represent "real" users of the network.

Learning Objectives

The network functions as a simplified social landscape for students to understand how ads, bots, and news feed algorithms affect opinion, trends, and discussion on a social network, and how that links up with broader spheres of discourse outside of the network. Some students may, for example, design bots that influence opinion in a certain direction, while others may design bots to influence opinion in a different direction, while still others may design bots that root out these interfering bots. Depending on how the network is designed, some students can be the managers of the social network platform.

The goal is for students to develop a comprehensive mental model about the dynamics of social media and communication in the internet age, to peek "behind the curtain" and develop a critical perspective when using social media and reading the news (i.e. develop social media literacy).


In theory this simulated social network can be extended with features that could be present on any social network, such as anonymous accounts, different kinds of blocking and muting functionality, and so on. Thus it can also be a place where students can experiment with new features to see how that affects dynamics on the network.


Ideally the simulator accommodates students who are comfortable with programming and those who aren't.

For students who aren't, bot templates could be provided which require little to no programming experience, or another layer can be developed where they "purchase" different bot, marketing, and so on services that run automatically.

If there are multiple classes going on, they can all work from the same simulation and take on different roles. If one class is focused on advertising, they can take on roles of the advertising ecosystem, while in another class perhaps they collectively take on the role of the platform. The potential for cross-class interactivity is exciting.

7x7 Cutting Room Floor

I was fortunate enough to participate in this year's edition of Rhizome's Seven on Seven with Sean Raspet. The event pairs an artist and a technologist and gives some limited time for the pair to come up with and implement a concept or project. In previous editions pairs only had a day or so; this time we had about a month.

It was enough time to churn through several ideas that never made it to the final presentation. We landed on producing a white paper proposing leveraging blockchain-based distributed computing to collectively simulate a complete human cell at the atomic level, starting with something relatively simple like a red blood cell. A human cell might have hundreds of trillions of atoms and so simulating one at the atomic resolution is basically infeasible with existing computational resources. But it is more feasible now than it was maybe a decade ago.

Through our research we came across some staggering statistics about the computing power of the Bitcoin network, namely that it is estimated to have an aggregate computing power of 80.7 zettaFLOPS (80.7 million petaFLOPS) as of May 2018. The world's reigning supercomputer, the Sunway TaihuLight, has a theoretical peak of 125 petaFLOPS. The Folding@Home network, which enables people to donate spare computing power for protein folding simulations, had an aggregate power of about 100 petaFLOPS in January 2018. Not bad for a volunteer distributed network, but still far off from the Bitcoin network. There are more details in the white paper, but those numbers stuck out.

Anyways, we went through a few ideas before we landed on this white paper. Our first focus was on the phosphorus commodity market in relationship to "peak phosphorus". This was something Sean had been researching for some time now, and for the past few months I've been poking around the agri-tech scene, so I was naturally drawn to it as a topic. The gist is that phosphorus is a mineral crucial to agriculture, a key component in fertilizers (along with nitrogen and potassium; the history of nitrogen fertilizer is very interesting and troubling one), and is basically a non-renewable resource (some can be recovered from waste but I'm not sure what percentage of it is recoverable). At some point in the relatively near future phosphorus extraction may become too expensive or difficult and that could lead to some serious food security crises. So we were thinking of various ways to represent this issue. Here are a few ideas we played around with.

Global phosphorus simulation

Phosphorus, like any resource-extractive industry, is global. We wanted to be able to convey geopolitical issues like Morocco's occupation of Western Sahara, which is where Morocco mines its phosphorus. The most straightforward way to do something like that is a 4X-style global simulation, so we played around with that first.

I designed a little framework for laying out a hex-based map (similar to the cartog library I created for my Simulation & Cybernetics class, but I wanted to support 3D):

3D hex-based maps
3D hex-based maps

That's about as far as we got in terms of implementation. But the general idea was that we'd model the dynamics of the global phosphorus market, with some shocks and random events, and projections of changes in relevant indicators like growth rates, meat consumption rates, and so on. And somehow you'd see these effects on this map and through changes in the price of commodity phosphorus.

I didn't want this hex map to be the only "output" of the simulation. We wanted to show that the macro-level dynamics of the phosphorus market are intimately connected to the health of individual plants, and so I wanted to setup an automated growing system as a more material visualization. The system would be hydroponic or aeroponic, with a phosphorus nutrient pump that releases more or less phosphorus depending on its simulated price. As peak phosphorus approaches, the plant's health starts to deteriorate as it manifests symptoms of phosphorus deficiency. There were a few issues here, namely that 1) it's a pretty big task to set up such a growing system, and 2) the changes in the plant's health would happen over long time scales relative to the simulation (e.g. one simulation year might run in one real minute, and the impacts on the plant's health might not be visible for a few real days).

Phosphorus deficiency in corn
Phosphorus deficiency in corn

A build on this idea we considered is that we'd reserve some set amount of funds for the plant, and it would actually have to "purchase" phosphorus from the nutrient reservoir on its own.

Commodity traders vs food consumers

For awhile I've wanted to make an asymmetric game which consists of two separate games that are at first glance unrelated. For example, on one side of the room is a relatively innocuous-looking life simulator game where you have to e.g. buy a house and care for your family. On the other side of the room is a stock market game where you just try to earn the highest return on your investments. What isn't apparent at first is that the actions of the player in the stock market game directly affect how difficult the life-simulator game is, for example, by triggering financial crises or affecting house prices.

We briefly considered doing something along these lines. The idea was that when we presented, we'd direct audience members to a website where they could join our phosphorus game. Some audience members would be redirected to the "commodity trader" version of the game, while others would instead be redirected to the "food consumer" version.

The commodity trader game is basically same as the stock market game, except just for phosphorus trading.

Commodity trader interface
Commodity trader interface

The food consumer game is built around a "basket", like a simplified version of a consumer price index focused on products especially affected by phosphorus prices. As a player you'd have some nutritional requirements to meet or some other purchasing obligations and some weekly budget with which to buy food. We didn't really get far enough to thoroughly think through the mechanics.

Food consumer interface
Food consumer interface

I did have a really fun time modeling the food:

Food models
Food models

Plant care Tamagotchi

Riffing off the plant-as-visualization idea, we also toyed around with the idea of some kind of plant-tamagotchi. You'd have to manage its water and phosphorus needs by doing some sort of trading or other gameplay. I can't really remember how far we got with the design.

Plant care
Plant care

I did enjoy making this wilting animation though:

Plant wilting
Plant wilting

Physics-based food thing

I honestly can't remember what the concept was for this. The most I can recall is that we discussed a system where you could rapidly click on some food or raw material objects to create derivative objects (such as beef and milk from a cow) and that somehow we'd connect that to the relative use of phosphorus in these products. For example, a cow requires a lot of feed which requires a lot of phosphorus, which results in a less efficient phosphorus-to-calorie ratio than if you had just eaten the feed grains yourself. I think I was really just excited about making something physics-based.

Picking up objects
Picking up objects
Tapping on objects for derivatives
Tapping on objects for derivatives
Bouncing around
Bouncing around

I'll definitely use this again for a different project.

The Infinite Card Game

  • 02.28.2018
  • etc

Kira and I were in Australia most of last month, and near where we were staying in Melbourne was a game shop. We had a free Friday night so I stopped by for my first Magic: The Gathering (MTG) draft event, and it got me thinking about designing card game systems.

MTG is a collectible card game with a great deal of strategic depth. Games with large state spaces like Chess and more recently Go have been more-or-less "solved"1; The state space of MTG is certainly orders of magnitude larger than Chess and Go, given the massive back catalog of cards (going back to 1993!)2 and the ever-growing number of interactions between them. Though the state space of Starcraft is likely larger (and people are working on "solving" it), to my knowledge MTG has not yet been solved in this sense.

For those unfamiliar with MTG, it's played between two or more players and involved constructing a deck of cards around a particular strategy. Some strategies may emphasize fast, aggressive plays ("aggro") which, if failing to win quickly, lose steam in longer matches. Others may focus on slowing opponents down by stopping plays short or making actions more expensive ("control"). And there are other strategies still.

MTG has a variety of game formats which govern how decks are constructed and can affect other game rules. These formats are broadly divided into Constructed and Limited formats. Constructed formats are where players carefully design and assemble their decks in advance. This gives plenty of space for creative, expressive strategies since players have a large pool of cards to select from. In contrast, Limited formats mean that players are given a small amount of random cards drawn from a set of cards and need to assemble a deck on-the-spot (a process called "drafting").

I've mostly played Constructed formats, but now that I've tried Limited a bit more I'm coming to prefer the randomness and uncertainty of Limited formats. In Limited you have to think on your feet more, design your deck more delicately (you aren't sure what to expect from your opponents), and work within a tighter set of constraints. It makes for more challenging and exciting games.

The problem with Limited is that each set has roughly 200-300 cards. After a few games you'll be familiar with every card and players have learned the strategies that work best within that set. Games start to get formulaic and stale. It loses that sense of uncertainty that makes Limited exciting in the first place. It isn't until the next set is released, with new cards and abilities, that things are interesting again.

These sets are carefully designed such that the cards have enough variation to keep things interesting, but not so much that they're totally incoherent (Mark Rosewater, the lead designer of MTG, has a great podcast delving into this design process). And they are meticulously balanced so no strategy is strictly better than any others.

That being said...I wonder if there's a way to design an infinite set, i.e. a dynamic and self-adjusting process which outputs a stream of cards to draw from for a never-ending Limited format. Such a system would need some rule scaffolding or framework (doesn't have to be MTG's) from which it can derive new mechanics and costs (some quantifier of their power), and then generate a balance of cards over some probability distribution.

For example, a core mechanic in MTG is that you have creatures that can attack opponents to damage them. Players can use their own creatures to "block" opponents' attacking creatures. These creatures have some cost to play ("cast") them; generally stronger creatures have the drawback of costing more to cast. Sometimes they may have other abilities which make them more or less versatile, which is compensated by a respective increase or decrease in casting cost. Sometimes you have creatures which are disproportionately cheap in casting cost for their strength, but these are rare.

Let's say the game for this infinite draft system has just this simple attacking-creature mechanic. Our creatures have only a strength and a casting cost. Generally, the stronger the creature, the higher it's casting cost. But not always — on rare occasions we might have a strong creature that's a bit cheaper than normal. Finally, we add the additional constraint that weaker creatures are more likely, so we emphasize strong creatures as a more notable event.

What we're essentially saying is that the casting cost of a creature is dependent on its strength (and vice versa), which we can represent as a simple Bayes net:

G strength strength casting_cost casting_cost strength->casting_cost

When we want to create a new card, we can first sample its strength, then sampling a casting cost depending on the value we sampled for its strength.

We want a creature's strength to be a positive integer, say in the range $[1, 12]$. So we want a discrete probability distribution with finite support. We could use a Beta-binomial distribution, e.g. $\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$, which has the properties we want:

$\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$
$\text{BetaBinomial}(\alpha=2, \beta=6, n=12)$

Here creatures will tend to have a strength somewhere in $[1, 4]$ and very rarely above 6. Then we can do something similar with casting cost, except that it's dependent on the strength.

This is an extremely simple game and so not a very interesting one. We'd want to add in additional abilities that interact in interesting ways. For example, in MTG the "flying" ability makes a creature blockable only by other creatures with flying. So we can add in some small probability of a creature gaining flying, and have that also affect the casting cost's distribution.

MTG's Flying mechanic

A really nice version of this system is one where you can pass in an arbitrary network relating costs and abilities (a more complex example of the one above), and it would output card descriptions in some interchange form (e.g. JSON), and you can use that to print cards with whatever design you wanted.

A few years ago I prototyped a similar system of cost-based card generation, which incorporated different card types beyond creatures (which I called "units"), additional abilities, and procedurally-generated names.

An example generated card
An example generated card

This prototype doesn't incorporate intra-card balance beyond what falls out of cost-balanced cards. The relative effectiveness of various abilities are really hard to objectively quantify, since their costs are really relative to what the dominant strategies are, i.e. the metagame. So ideally this infinite draft system not only generates balanced cards but also tracks how people are playing them to learn which strategies seem over or underpowered, and correspondingly tweaks the costs of abilities related to those strategies.

The generation features I just described are more about balance but another interesting feature would be introducing a balance-drift so that gameplay never stagnates in an equilibrium of strategies. Perhaps once balance is achieved the system can gradually and temporarily bias the game towards different strategies to encourage different kinds of gameplay. That way there'd be an ebb-and-flow that keeps things interesting in the aggregate and subtly changing the overall feeling of the game.

For example, if the system sees that players almost exclusively play low-strength, cheap creatures, and almost no larger creatures, maybe it will start to slightly cheapen the larger creatures so they see more play. That in turn may cause a new strategy to dominate, maybe a slower gameplay with larger creatures, and eventually a different change would be introduced to shock the system to a different strategy equilibrium.

I've given a very hand-wavy outline of this system here and as described it by no means would match MTG's hand-designed depth and complexity. But I do like the idea of a general system where you input some mechanics and it outputs a series of cards to play a game with. You could, perhaps, model many different systems via a card game format this way.

  1. Not solved in the proper sense, but human players are reliably bested by computer players. 

  2. Based on's data, there are 18,191 unique cards as of Jan 21, 2018.