Simulating a multi-node (Py)Spark cluster using Docker


I'm working on a set of tools for the Coral Project to make building data analysis pipelines easy and, perhaps one day, accessible to even non-technical folks. Part of what will be offered is a way of easily toggling between running pipelines on a in parallel on a local machine or on a distributed computing cluster. That way, the pipelines that a small organization uses for their data can be adapted to a larger organization just by spinning up the setup described below and changing a configuration option.

I wanted to simulate a multi-node cluster for developing these tools, and couldn't find any guides for doing so. So after some research, here is one.

The setup that follows runs all on one machine (remember, it just simulates a multi-node cluster), but it should be easily adaptable to a real multi-node cluster by appropriately changing the IPs that the containers use the communicate.

I have made available a repo with the Dockerfiles and scripts described below.

The Stack

A lot goes into the cluster stack:

  • Spark - used to define tasks
  • Mesos - used for cluster management
  • Zookeeper - used for electing Mesos leaders
  • Hadoop - used for HDFS (Hadoop Distributed File System)
  • Docker - for containerizing the above

The Setup

There will be a client machine (or "control node"), which is the machine we're working from. In this walkthrough, the client machine also functions as the Docker host (where the Docker containers are run).

Docker containers are spun up for each other part of the stack, and they all communicate via their "external" Docker IPs.

Setting up the client

I'm assuming a Linux environment because that's what Docker works best with (on OSX you are probably running it in a Linux VM anyways). The following instructions are for Ubuntu but should be replicable on other distros.

The client needs to have Spark and Mesos installed to properly interact with the cluster.

Spark has precompiled binaries available on their downloads page which are easily installed:

# go to <>
# select and download the version you want
tar -xzvf spark-*.tgz
rm spark-*.tgz
sudo mv spark* /usr/local/share/spark

Add the following to your ~/.bash_profile as well:

export SPARK_HOME=/usr/local/share/spark

# so pyspark can be imported in python

PySpark has one final requirement, the py4j library:

pip install py4j

Mesos does not have any precompiled binaries, so you must compile it yourself:


# sources available at <>
tar -zxf mesos-*.tar.gz
rm mesos-*.tar.gz

# dependencies
sudo apt-get install -y openjdk-7-jdk build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev

# by default, this installs to /usr/local
cd mesos*
mkdir build
cd build
sudo make install

Finally, we need to configure Spark to use a Mesos cluster:

cp $SPARK_HOME/conf/ $SPARK_HOME/conf/
echo 'export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/' >> $SPARK_HOME/conf/

That's all for the client.

Setting up the Docker images

Our "cluster" will consist of several Docker containers, with one (or more) for each part of the stack, so we create images for each.


The Zookeeper image is straightforward:

FROM ubuntu:14.04

ENV ZOOKEEPER_PATH /usr/local/share/zookeeper

# update
RUN apt-get update
RUN apt-get upgrade -y

# dependencies
RUN apt-get install -y wget openjdk-7-jre-headless

# zookeeper
RUN wget${ZOOKEEPER_V}/zookeeper-${ZOOKEEPER_V}.tar.gz
RUN tar -zxf zookeeper-*.tar.gz
RUN rm zookeeper-*.tar.gz
RUN mv zookeeper-* $ZOOKEEPER_PATH
RUN mv $ZOOKEEPER_PATH/conf/zoo_sample.cfg $ZOOKEEPER_PATH/conf/zoo.cfg



CMD ["start-foreground"]

A Zookeeper binary is downloaded and installed, then the default config is copied over. We start the Zookeeper service in the foreground so the Docker container does not immediately exit.


The Hadoop image is more involved:

FROM ubuntu:14.04

ENV HADOOP_HOME /usr/local/hadoop
ENV JAVA_HOME /usr/lib/jvm/java-7-openjdk-amd64
ENV HADOOP_TMP /var/hadoop/tmp

# update
RUN apt-get update
RUN apt-get upgrade -y

# dependencies
RUN apt-get install -y openssh-server openjdk-7-jdk wget

# disable ipv6 since hadoop does not support it
RUN echo 'net.ipv6.conf.all.disable_ipv6 = 1' >> /etc/sysctl.conf
RUN echo 'net.ipv6.conf.default.disable_ipv6 = 1' >> /etc/sysctl.conf
RUN echo 'net.ipv6.conf.lo.disable_ipv6 = 1' >> /etc/sysctl.conf

# hadoop
RUN wget${HADOOP_V}/hadoop-${HADOOP_V}.tar.gz
RUN tar -zxf hadoop-*.tar.gz
RUN rm hadoop-*.tar.gz
RUN mv hadoop-* $HADOOP_HOME

# hadoop tmp directory
RUN mkdir -p $HADOOP_TMP
RUN chmod 750 $HADOOP_TMP

# configs
RUN echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_HOME/etc/hadoop/
ADD docker/assets/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml

# auth
# the provided config saves us from having
# to accept each new host key on connect
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/ /root/.ssh/authorized_keys
ADD docker/assets/ssh_config /root/.ssh/config

# format the hdfs
RUN hdfs namenode -format

ADD docker/assets/start_hadoop start_hadoop

EXPOSE 8020 50010 50020 50070 50075 50090

CMD ["-d"]
ENTRYPOINT ["./start_hadoop"]

It does the following:

  • A Hadoop binary is downloaded and installed
  • IPV6 is disabled because Hadoop does not support it
  • SSH auth is setup because Hadoop uses it for connections
  • Hadoop is configured with the proper Java install

For SSH, a config which frees us from having to manually accept new hosts is copied over:

Host *
    UserKnownHostsFile /dev/null
    StrictHostKeyChecking no

A core-site.xml config file is also added, which includes:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <description>A base for other temporary directories.</description>

  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>

The important part here is the fs.defaultFS property which describes how others can access the HDFS. Here, the value is localhost, but that is replaced by the start_hadoop script (see below) with the container's "external" IP.

And finally, a start_hadoop script is copied over, which includes:


# get "external" docker ip
HDFS_IP=$(ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{print $1}')

# set the proper ip in the HDFS config
sed -i 's/localhost/'${HDFS_IP}'/g' $HADOOP_HOME/etc/hadoop/core-site.xml

/etc/init.d/ssh restart

if [[ $1 == "-d" ]]; then
    while true; do sleep 1000; done

if [[ $1 == "-bash" ]]; then

As mentioned, it replaces the localhost value in the core-site.xml config with the "external" IP so that others can connect to the HDFS.

It also starts the SSH service, then starts the HDFS, and, with the -d flag (which is passed in the above Dockerfile), emulates a foreground service so that the Docker container does not exit.


For the Mesos leader and followers, we first create a base Mesos image and then use that to create the leader and follower images.

The base Mesos image Dockerfile:

FROM ubuntu:14.04

ENV MESOS_V 0.24.0

# update
RUN apt-get update
RUN apt-get upgrade -y

# dependencies
RUN apt-get install -y wget openjdk-7-jdk build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev

# mesos
RUN wget${MESOS_V}/mesos-${MESOS_V}.tar.gz
RUN tar -zxf mesos-*.tar.gz
RUN rm mesos-*.tar.gz
RUN mv mesos-* mesos
RUN mkdir build
RUN ./configure
RUN make
RUN make install

RUN ldconfig

This just builds and installs Mesos.

The leader Dockerfile:

FROM mesos_base
ADD docker/assets/start_leader start_leader
ENTRYPOINT ["./start_leader"]

It exposes the Mesos leader port and copies over a start_leader script, which contains:


# get "external" docker IP
LEADER_IP=$(ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{print $1}')
mesos-master --registry=in_memory --ip=${LEADER_IP} --zk=zk://${ZOOKEEPER}/mesos

All this does is tell the leader to use its "external" IP, which is necessary so that the Mesos followers and the client can properly communicate with it.

It also requires a ZOOKEEPER env variable to be set; it is specified when the Docker container is run (see below).

The follower Dockerfile:

FROM mesos_base

ADD docker/assets/start_follower start_follower


# permissions fix

# use python3 for pyspark
RUN apt-get install python3
ENV PYSPARK_PYTHON /usr/bin/python3

ENTRYPOINT ["./start_follower"]

There is a bit more going on here. The Mesos follower port is exposed and a few env variables are set. The MESOS_SWITCH_USER variable is a fix for a permissions issue, and the PYSPARK_PYTHON lets Spark know that we will use Python 3.

Like the leader image, there is a start_follower script here, which is simple:

mesos-slave --master=zk://${ZOOKEEPER}/mesos

Again, it uses a ZOOKEEPER env variable which is specified when the container is run.

Building the images

Finally, we can build all the images:

sudo docker build -f Dockerfile.mesos -t mesos_base .
sudo docker build -f Dockerfile.follower -t mesos_follower .
sudo docker build -f Dockerfile.leader -t mesos_leader .
sudo docker build -f Dockerfile.zookeeper -t mesos_zookeeper .
sudo docker build -f Dockerfile.hadoop -t hadoop .

Running the cluster

With all the images built, we can start the necessary Docker containers.

First, start a Zookeeper container:

sudo docker run --name mesos_zookeeper -itP mesos_zookeeper

When it's running, make a note of its IP:

ZOOKEEPER_IP=$(sudo docker inspect --format '{{.NetworkSettings.IPAddress }}' $(sudo docker ps -aq --filter=name=mesos_zookeeper))

Then, start the Hadoop container:

sudo docker run --name hadoop -itP hadoop

Note that our container name here should not have underscores in it, because Java can't handle hostnames with underscores.

Then, start a Mesos leader container:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader -itP mesos_leader

Note that we set the ZOOKEEPER env variable here.

Finally, start a Mesos follower container in the same fashion:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower -itP mesos_follower

Using the cluster

With the client setup and the cluster containers running, we can start using PySpark from the client machine.

We'll do the classic word count example to demonstrate the process.

First, open a shell in the Hadoop container:

sudo docker exec -it hadoop bash

From this container, grab a text file to work with and put it in the HDFS so the Mesos followers can access it:

hadoop fs -put pg4300.txt /sample.txt

Now, back in the client machine, we can put together a Python script to count the words in this file.

First, we need to know the Zookeeper host, so PySpark knows where to find the cluster, and the Hadoop IP, so PySpark knows where to grab the file from. We'll pass them in as command-line arguments and grab them using the sys library:

import sys
import pyspark

zookeeper = sys.argv[1]
hadoop_ip = sys.argv[2]

Then we can specify where to find the text:

src = 'hdfs://{}:8020/sample.txt'.format(hadoop_ip)

And configure PySpark:

conf = pyspark.SparkConf()

One important configuration option is spark.executor.uri, which tells Mesos followers where they can get the Spark binary to properly execute the tasks. This must be a prebuilt Spark archive, i.e. a Spark binary package. You can build it and host it yourself if you like.

conf.set('spark.executor.uri', '')

Then we can create the SparkContext with our config and define the task:

sc = pyspark.SparkContext(conf=conf)

lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(' '))
word_count = ( x: (x, 1)).reduceByKey(lambda x, y: x+y))

Save this file as

There are a couple gotchas when running this script.

We cannot run it with a simple python If we do so, then PySpark will use the client's local IP, e.g. something like We want PySpark to use the client's Docker IP so that it can properly communicate with the other Docker containers, and specify this as an env variable called LIBPROCESS_IP:

export LIBPROCESS_IP=$(ifconfig docker0 | grep 'inet addr:' | cut -d: -f2 | awk '{print $1}')

Then, we must also specify the proper Python version for the client's Spark install:

export PYSPARK_PYTHON=/usr/bin/python3

Because we're also passing in the Zookeeper connection string and the Hadoop IP, let's get those too:

HADOOP_IP=$(sudo docker inspect --format '{{.NetworkSettings.IPAddress }}' $(sudo docker ps -aq --filter=name=hadoop))

And now we can run the script:


Multi-node/high-availability setup

So far we only have one follower, but to better emulate a multi-node setup, we want many followers. This is easy to do, just spin up more follower Docker containers with the proper ZOOKEEPER variable:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower1 -itP mesos_follower
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower2 -itP mesos_follower
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_follower3 -itP mesos_follower
# etc

For a high-availability setup, we can also create many leaders in a similar way:

sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader1 -itP mesos_leader
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader2 -itP mesos_leader
sudo docker run -e ZOOKEEPER=${ZOOKEEPER_IP}:2181 --name mesos_leader3 -itP mesos_leader
# etc

These leaders will all register with Zookeeper and Zookeeper will elect one to be the "active" leader. The followers will coordinate with Zookeeper to figure out which leader they should be talking to. If one leader goes down, Zookeeper will elect a new active leader in its place.

We can even have multiple Zookeeper containers, but I haven't yet tried it out.


This repo has all of the files mentioned with a script that makes it easy to spin up this entire setup.


Automatically identifying voicemails

Back in 2015, prosecutor Alberto Nisman was found dead under suspicious circumstances, just as he was about to bring a complaint accusing the Argentinian President Fernández over interfering with investigations into the AMIA bombing that took place in 1994 (This Guardian piece provides some good background).

There were some 40,000 phone calls related to the case that La Nación was interested in exploring further. Naturally, that is quite a big number and it's hard to gather the resources to comb through that many hours of audio.

La Nación crowdsourced the labeling of about 20,000 of these calls into those that were interesting and those that were not (e.g. voicemails or bits of idle chatter). For this process they used CrowData, a platform built by Manuel Aristarán and Gabriela Rodriguez, two former Knight-Mozilla Fellows at La Nación. This left about 20,000 unlabeled calls.

While Juan and I were in Buenos Aires for the Buenos Aires Media Party and our OpenNews fellows gathering, we took a shot at automatically labeling these calls.

Data preprocessing

The original data we had was in the form of mp3s and png images produced from the mp3s. wav files are easier to work with so we used ffmpeg to convert the mp3s. With wav files, it is just a matter of using scipy to load them as numpy arrays.

For instance:

import import wavfile

sample_rate, data ='/path/to/file.wav')
# [15,2,5,6,170,162,551,8487,1247,15827,...]

In the end however, we used librosa, which normalizes the amplitudes and computes a sample rate for the wav file, making the data easier to work with.

import librosa

data, sr = librosa.load('/path/to/file.wav', sr=None)
# [0.1,0.3,0.46,0.89,...]

These arrays can be very large depending on the audio file's sample rate, and quite noisy too, especially when trying to identify silences. There may be short spikes in amplitude in an otherwise "silent" section, and in general, there is no true silence. Most silences are just low amplitude but not exactly 0.

In the example below you can see that what a person might consider silence has a few bits of very quiet sound scattered throughout.

Silences aren't exactly silent
Silences aren't exactly silent

There is also "noise" in the non-silent parts; that is, the signal can fluctuate quite suddenly, which can make analysis unwieldy.

To address these concerns, our preprocessing mostly consisted of:

  • Reducing the sample rate a bit so the arrays weren't so large, since the features we looked at don't need the precision of a higher sample rate.
  • Applying a smoothing function to deal with intermittent spikes in amplitude.
  • Zeroing out any amplitudes below 0.015 (i.e. we considered any amplitude under 0.015 to be silence).

Since we had about 20,000 labeled examples to process, we used joblib to parallelize the process, which improved speeds considerably.

The preprocessing scripts are available here.

Feature engineering

Typically, the main challenge in a machine learning problem is that of feature engineering - how do we take the raw audio data and represent it in a way that best suits the learning algorithm?

Audio files can be easily visualized, so our approach benefited from our own visual systems - we looked at a few examples from the voicemail and non-voicemail groups to see if any patterns jumped out immediately. Perhaps the clearest two patterns were the rings and the silence:

  • A voicemail file will also have a greater proportion of silence than sound. For this, we looked at the images generated from the audio and calculated the percentage of white pixels (representing silence) in the image.
  • A voicemail file will have several distinct rings, and the end of the file comes soon after the last ring. The intuition here is that no one picks up during a voicemail - hence many rings - and no one stays on the line much longer after the phone stops ringing. So we consider both the number of rings and the time from the last ring to the end of the file.

Ring analysis

Identifying the rings is a challenge in itself - we developed a few heuristics which seem to work fairly well. You can see our complete analysis here, but the general idea is that we:

  • Identify non-silent parts, separated by silences
  • Check the length of the silence that precedes the non-silent part, if it is too short or too long, it is not a ring
  • Check the difference between maximum and minimum amplitudes of the non-silent part; it should be small if it's a ring

The example below shows the original audio waveform in green and the smoothed one in red. You can see that the rings are preceded by silences of a roughly equivalent length and that they look more like plateaus (flat-ish on the top). Another way of saying this is that rings have low variance in their amplitude. In contrast, the non-ring signal towards the end has much sharper peaks and vary a lot more in amplitude.

Automatic ring detection
Automatic ring detection

Other features

We also considered a few other features:

  • Variance: voicemails have greater variance, since there is lots of silence punctuated by high-amplitude rings and not much in between.
  • Length: voicemails tend to be shorter since people hang up after a few rings.
  • Max amplitude: under the assumption that human speech is louder than the rings
  • Mean silence length: under the assumption that when people talk, there are only short silences (if any)

However, after some experimentation, the proportion of silence and the ring-based features performed the best.

Selecting, training, and evaluating the model

With the features in hand, the rest of the task is straightforward: it is a simple binary classification problem. An audio file is either a voicemail or not. We had several models to choose from; we tried logistic regression, random forest, and support vector machines since they are well-worn approaches that tend to perform well.

We first scaled the training data and then the testing data in the same way and computed cross validation scores for each model:

    roc_auc: 0.96 (+/- 0.02)
    average_precision: 0.94 (+/- 0.03)
    recall: 0.90 (+/- 0.04)
    f1: 0.88 (+/- 0.03)
    roc_auc: 0.96 (+/- 0.02)
    average_precision: 0.95 (+/- 0.02)
    recall: 0.89 (+/- 0.04)
    f1: 0.90 (+/- 0.03)
    roc_auc: 0.96 (+/- 0.02)
    average_precision: 0.94 (+/- 0.03)
    recall: 0.91 (+/- 0.04)
    f1: 0.90 (+/- 0.02)

We were curious what features were good predictors, so we looked at the relative importances of the features for both logistic regression:

[('length', -3.814302896584862),
 ('last_ring_to_end', 0.0056240364270560934),
('percent_silence', -0.67390678402142834),
('ring_count', 0.48483923341906693),
('white_proportion', 2.3131580570928114)]

And for the random forest classifier:

[('length', 0.30593363755717351),
 ('last_ring_to_end', 0.33353202776482688),
 ('percent_silence', 0.15206534339705702),
 ('ring_count', 0.0086084243372190443),
 ('white_proportion', 0.19986056694372359)]

Each of the models perform about the same, so we combined them all with a bagging approach (though in the notebook above we forgot to train each model on a different training subset, which may have helped performance), where we selected the label with the majority vote from the models.


We tried two variations on classifying the audio files, differing in where we set the probability cutoff for classifying a file as uninteresting or not.

in the balanced classification, we set the probability threshold to 0.5, so any audio file that has ≥ 0.5 of being uninteresting is classified as uninteresting. This approach labeled 8,069 files as discardable.
in the unbalanced classification, we set the threshold to the much stricter 0.9, so an audio file must have ≥ 0.9 chance of being uninteresting to be discarded. This approach labeled 5,785 files as discardable.

You can see the Jupyter notebook where we conducted the classification here.


We have also created a validation Jupyter notebook where we can cherry pick random results from our classified test set and verify the correctness ourselves by listening to the audio file and viewing its image.

The validation code is available here.


Even though using machine learning to classify audio is noisy and far from perfect, it can be useful making a problem more manageable. In our case, our solution narrowed the pool of audio files to only those that seem to be more interesting, reducing the time and resources needed to find the good stuff. We could always double check some of the discarded ones if there’s time to do that.


Broca's Area
Broca's Area

At this year's OpenNews Code Convening, Alex Spangher of the New York Times and I worked on broca, which is a Python library for rapidly experimenting with new NLP approaches.

Conventional NLP methods - bag-of-words vector space representations of documents, for example - generally work well, but sometimes not well enough, or worse yet, not well at all. At that point, you might want to try out a lot of different methods that aren't available in popular NLP libraries.

Prior to the Code Convening, broca was little more than a hodgepodge of algorithms I'd implemented for various projects. During the Convening, we restructured the library, added some examples and tests, and implemented in the key piece of broca: pipelines.


The core of broca is organized around pipes, which take some input and produce some output, which are then chained into pipelines.

Pipes represent different stages of an NLP process - for instance, your first stage may involve preprocessing or cleaning up the document, the next may be vectorizing it, and so on.

In broca, this would look like:

from broca.pipeline import Pipeline
from broca.preprocess import Cleaner
from broca.vectorize import BoW

docs = [
    # ...
    # some string documents
    # ...

pipeline = Pipeline(

vectors = pipeline(docs)

Since a key part of broca is rapid prototyping, it makes it very easy to simultaneously try different pipelines which may vary in only a few components:

from broca.vectorize import DCS

pipeline = Pipeline(
        [BoW(), DCS()]

This would produce a multi-pipeline consisting of two pipelines: one which vectorizes using BoW, the other using DCS.

Multi-pipelines often have shared components. In the example above, Cleaner() is in both pipelines. To avoid redundant processing, a key part of broca's pipelines is that the output for each pipe is "frozen" to disk.

These frozen outputs are identified by a hash derived from the input data and other factors. If frozen output exists for a pipe and its input, that frozen output is "defrosted" and returned, saving unnecessary processing time.

broca's Cryo
broca's Cryo

This way, you can tweak different components of the pipeline without worrying about needing to re-compute a lot of data. Only the parts that have changed will be re-computed.

Included pipes

broca includes a few pipes:

  • broca.tokenize includes various tokenization methods, using lemmas and a few different keyword extractors.
  • broca.vectorize includes a traditional bag-of-words vectorizer, an implementation of "dismabiguated core semantics", and Doc2Vec.
  • broca.preprocess includes common preprocessors - cleaning punctuation, HTML, and a few others.

Other tools

Not everything in broca is a pipe. Also included are:

  • broca.similarity includes similarity methods for terms and documents.
  • broca.distance includes string distance methods (this may be renamed later).
  • broca.knowledge includes some tools for dealing with external knowledge sources (e.g. other corpora or Wikipedia).

Though at some point these may also become pipes.

Give us your pipes!

We made it really easy to implement your own pipes. Just inherit from the Pipe class, specify the class's input and output types, and implement the __call__ method (that's what's called for each pipe).

For example:

from broca.pipeline import Pipe

class MyPipe(Pipe):
    input =
    output = Pipe.type.vecs

    def __init__(self, some_param):
        self.some_param = some_param

    def __call__(self, docs):
        # do something with docs to get vectors
        vecs = make_vecs_func(docs, self.some_param)
        return vecs

We hope that others will implement their own pipes and submit them as pull requests - it would be great if broca becomes a repository of sundry NLP methods which makes it super easy to quickly try a battery of techniques on a problem.

broca is available on GitHub and also via pip:

pip install broca

Fellowship Status Update


I've long been fascinated with the potential for technology to liberate people from things people would rather not be doing. This relationship between human and machine almost invariably manifests in the context of production processes - making this procedure a bit more efficient here, streamlining this process a bit there.

But what about social processes? A hallmark of the internet today is the utter ugliness that's possible of people; a seemingly inescapable blemish on the grand visions of the internet's potential for social transformation. And the internet's opening of the floodgates has had the expected effect of information everywhere, though perhaps in volumes greater than anyone anticipated.

Here we are, trying to engage little tyrants at considerable emotional expense. Here we are, futilely chipping away at the info-deluge we're suspended in. Here we are, both these things gradually chipping away at us. Things people would rather not be doing.


Prior to my fellowship, these kinds of inquiries had to be relegated to off-hours skunkworks. The fellowship has given me the rare privilege of autonomy, both financial and temporal, and the resources, especially of the human kind, with which I can actually explore these questions as my job.

With the Coral Project, I'm researching what makes digital communities tick, surveying the problems with which they grapple, and learning about how different groups are approaching them - from video games to journalism to social networks both in the mainstream and the fringes (you can read my notes here). Soon we'll be building software to address these issues.

For my own projects I'm working on automatic summarization of comments sections, a service that keeps up with news when you can't, a reputation system for new social network, and all the auxiliary tools these kinds of projects tend to spawn, laying the groundwork for work I hope to continue long after the fellowship. I've been toying with the idea of simulating social networks to provide testing grounds for new automated community management tools. The best part is that it's up to me whether or not I pursue it.

A huge part of the fellowship is learning, which is something I love but have had to carve out my own time for. Here, it's part of the package. I've had ample opportunity to really dig into the sprawling landscape of machine learning and AI (my in-progress notes are here), something I've long wanted to do but never had the space for.

The applications for the 2016 fellowship are open, and I encourage you to apply. Rub shoulders with fantastic and talented folks from a variety of backgrounds. Pursue the questions that conventional employment prohibits you from. Explore topics and skills you've never had time for. It's really what you make of it. At the very least, it's a unique opportunity to be deliberate about where your work takes you.

The halfway mark of my OpenNews fellowship has just about passed. I knew from the start the time would pass by quickly, but I hadn't considered how much could happen in this short a time. There are only about 5 months left - the fellowship does end, but, fortunately, the work it inaugurates doesn't have to.

Geiger (Intro/Update)

A couple months ago I thought it would be interesting to see if a summary could be generated for a comment section. As a comment section grows, the comments become more repetitive as more people pile into make the same point. It also seems that some natural clustering forms as some commenters focus on particular aspects of an article or topic.

When there are hundreds to thousands of comments, there is little to be gained by reading all of them. However, it may be useful to quantify how much support certain opinions have, or what is most salient about a particular topic. What if there we had some automated means of presenting us such insight? For example, for an article about a new coal industry regulation: 27 comments are focused on how this regulation affects jobs, 39 are arguing about the environmental impacts, 6 are mentioning the meaning of this regulation in an international context, etc.

Having such insight can serve a number of purposes:

  • Provide a quick understanding of the salient points for readers of an article
  • Direct focus for future articles on the topic
  • Give a quick view into how people are responding to an article
  • Provide fodder for follow-up pieces on how people are responding
  • Surface entry points for other readers into the conversation

Geiger is still very much a work in progress and has led to a lot of experimentation, some of which worked ok, some of which didn't work at all, but so far nothing has worked as well as I'd like.

Below is a screenshot from an early prototype of Geiger which allowed me to try a battery of common techniques (TF-IDF bag of words with DBSCAN, K-Means, HAC, and LDA) and compare their results on any New York Times article with comments.

An early Geiger prototype
An early Geiger prototype

None of those led to particularly promising results, but a few alternatives were available.

Aspect summarization

This problem of clustering-to-summarize comments is similar to aspect summarization, which is more closely associated with ratings and reviews. For instance, you may have seen how Yelp's business pages have a few sentences selected at the top, with some term (the "aspect") highlighted, and then the number of reviewers that mentioned this term. That's aspect summarization - the aggregate reviews are being summarized by highlighting aspects which are mentioned the most.

Yelp's aspect summarization
Yelp's aspect summarization

Sometimes aspect summarization includes an additional layer of sentiment analysis, so that instead of just quantifying the number of people talking about an aspect, whether they are talking positively or negatively can also be surfaced (Yelp isn't doing this, however).

The process of aspect summarization can be broken down into three steps:

  1. Identify aspects
  2. Group documents by aspect
  3. Rank aspect groups

To identify aspects I used a few keyword extraction approaches (PoS tagging for noun phrases, named entity recognition, and other methods like Rapid Automatic Keyword Extraction) and then learned phrases by looking at keyword co-occurrences. If two keywords are adjacent (or separated by only a conjunction or hyphen) in at least 80% in the documents where they are present, we consider it a key phrase.

This simple co-occurrence approach is surprisingly effective. Here are some phrases learned on a set of comments for the coal industry regulation article:

'carbon tax', 'green energy', 'sun and wind', 'clean coal', 'air and water', 'high level', 'slow climate', 'middle class', 'signature environmental', 'mitch mcconnell', 'poor people', 'coal industry', 'true cost', 'clerical error', 'coal miner', 'representative democracy', 'co2 emission', 'power source', 'clean air', 'future generation', 'blah blah', 'ice age', 'planet earth', 'climate change', 'energy industry', 'critical thinking', 'particulate matter', 'coal mining', 'corporate interest', 'solar and wind', 'air act', 'acid rain', 'carbon dioxide', 'heavy metal', 'obama administration', 'monied interest', 'greenhouse gas', 'human specie', 'president obama', 'long term', 'political decision', 'big coal', 'coal and natural', 'al gore', 'bottom line', 'power generation', 'wind and solar', 'nuclear plant', 'global warming', 'human race', 'supreme court', 'environmental achievement', 'renewable source', 'coal ash', 'legal battle', 'united state', 'wind power', 'epa regulation', 'economic cost', 'federal government', 'state government', 'natural gas', 'west virginia', 'nuclear power', 'radioactive waste', 'battle begin', 'coal fire', 'energy source', 'common good', 'renewable energy', 'coal burning', 'nuclear energy', 'big tobacco', 'carbon footprint', 'red state', 'sea ice', 'peabody coal', 'tobacco industry', 'american citizen', 'fossil fuel', 'fuel industry', 'climate scientist', 'carbon credit', 'power plant', 'republican president', 'electricity cost'

Some additional processing steps were performed, such as removing keywords that were totally subsumed by key phrases; that is, keywords which only ever appear as part of a key phrase. Keywords were also stemmed and merged, e.g. "polluter", "pollute", "pollutant", "pollution" are grouped as a single aspect.

Grouping documents by aspects is straightforward (just look at overlaps). For this task I treated individual sentences as the documents, much like Yelp does.

Ranking them is a bit trickier. I used a combination of token length (assuming that phrases are more interesting than single keywords), support (number of sentences which mention the aspect), and IDF weighting of the aspect. The latter is useful because, for instance, we expect many comments will mention the "coal industry" if the article is about the coal industry, rendering it uninformative.

Although you get a bit of insight into what commenters are discussing, the results of this approach aren't very interesting. We don't really get any summary of what people are saying about an aspect. This is problematic when commenters are talking about an aspect in different ways. For instance, many commenters are talking about "climate change", but some ask whether or not the proposed regulation would be effective in mitigating it, whereas others debate whether or not climate change is a legitimate concern.

Finally, one problem here, which is consistent across all methods, is that this method is ignorant of synonymy - it cannot recognize when two words which look different mean essentially the same thing. For instance, colloquially people use "climate change" and "global warming" interchangeably, but here they are treated as two different aspects. This is a consequence of text similarity approaches which rely on matching the surface form of words - that is, which only look at exact term overlap.

This is especially challenging when dealing with short text documents, which I explain in greater length here.

Word2Vec word embeddings

There has been a lot of excitement around neural networks, and rightly so - their ability to learn representations is very useful. Word2Vec is capable of learning vector representations of words ("word embeddings") which allow us to capture some degree of semantic quality in vector space. For example, we could say that two words are semantically similar if their word embeddings are close to each other in vector space.

I loaded up Google's pre-trained Word2Vec model (trained on 100 billion words from a Google News dataset) and tested it out a bit. It seemed promising:

w2v.similarity('climate_change', 'global_warming')
>>> 0.88960381786226284

I made some attempts at approaches which leaned on this Word2Vec similarity of terms rather than their exact overlap - when comparing two documents A and B, each term from A is matched to is maximally-similar term in B and vice versa. Then the documents' similarity score is computed from these pairs' similarity values, weighted by their average IDF.

A problem with using Word2Vec word embeddings is that they are not really meant to quantify synonymy. Words that have embeddings close together do not necessarily mean similar things, all that it means is that they are exchangeable in some way. For example:

w2v.similarity('good', 'bad')
>>> 0.71900512146338569

The terms "good" and "bad" are definitely not synonyms, but they serve the same function (indicating quality or moral judgement) and so we expect to find them in similar contexts.

Because of this, Word2Vec ends up introducing more noise on occasion.

Measuring salience

Another angle I spent some time on was coming up with some better way of computing term "salience" - how interesting a term is. IDF is a good place to start, since a reasonable definition of a salient term is one that doesn't appear in every document, nor does it only appear in one or two documents. We want something more in the middle, since that indicates a term that commenters are congregating around.

Thus middle IDF values should be weighted higher than those closer to 0 or 1 (assuming these values are normalized to $[0,1]$). To capture this, I put terms' IDF values through a Gaussian function with its peak at $x=0.5$ and called the resulting value the term's "salience". Then, using the Word2Vec approach above, maximal pairs' similarity scores are weighted by their average salience instead of their average IDF.

The results of this technique look less noisy than before, but there is still ample room for improvement.


Working through a variety of approaches has helped clarify what the main difficulties of the problem are:

  • Short texts lack a lot of helpful context
  • Recognizing synonymy is tricky
  • Noise - some terms aren't very interesting given the article or what other people are saying

What's next

More recently I have been trying a new clustering approach (hscluster) and exploring ways of better measuring short text similarity. I'm also going to take a deeper look into topic modeling methods, which I don't have a good grasp on yet but seem promising.