As part of my OpenNews fellowship I have been catching up on the theory behind machine learning and artificial intelligence. My fellowship is ending soon, so today I'm releasing the notes from the past year study to bookend my experience.

To get a good grasp of a topic, I usually went through one or two courses on it (and a lot of supplemental explanations and overviews). Some courses were fantastic, some were real slogs to get through - teaching styles were significantly different.

The best courses has a few things in common:

1. introduce intuition before formalism
2. provide convincing motivation before each concept
3. aren't shy about providing concrete examples
4. provide an explicit structure of how different concepts are hierarchically related (e.g. "these techniques belong to the same family of techniques, because they all conceptualize things this way")
5. provide clear roadmaps through each unit, almost in a narrative-like way

The worst courses, on the other hand:

1. lean too much (or exclusively) on formalism
2. seem allergic to concrete examples
3. sometimes take inexplicit/unannounced detours into separate topics
4. just march through concepts without explaining how they are connected or categorized

Intuition before formalism

In machine learning, how you represent a problem or data is hugely important for successfully teaching the machine what you want it to learn. The same is, of course, for people. How we encode concepts into language, images, or other sensory modalities is critical to effective learning. Likewise, how we represent problems - that is, how we frame or describe them - can make all the difference in how easily we can solve them. It can even change whether or not we can solve them.

One of the courses I enjoyed the most was Patrick Winston's MIT 6.034 (Fall 2010): Artificial Intelligence. He structures the entire course around this idea of the importance of the right representation and for people, the right representation is often stories (he has a short series of videos, How to Speak, where he explains his lecturing methods - worth checking out).

Many technical fields benefit from presenting difficult concepts as stories and analogies - for example, many concepts in cryptography are taught with "Alice and Bob" stories, and introductory economics courses involve a lot of hypothetical beer and pizza. For some reason concepts are easier to grasp if we anthropomorphize them or otherwise put them into "human" terms, anointing them with volition and motivations (though not without its shortcomings). Presenting concepts as stories is especially useful because the concept's motivation is often clearly presented as part of the story itself, and they function as concrete examples.

Introducing concepts as stories also allows us to leverage existing intuitions and experiences, in some sense "bootstrapping" our understanding of these concepts. George Polya, in his Mathematics and Plausible Reasoning, argues that mathematical breakthroughs begin outside of formalism, in the fuzzy space of intuition and what he calls "plausible reasoning", and I would say that understanding in general also follows this same process.

On the other hand, when concepts are introduced as a dense mathematical formalism, it might make internal sense (e.g. "$a$ has this property so naturally it follows that $b$ has this property"), but it makes no sense in the grander scheme of life and the problems I care about. Then I start to lose interest.

Concept hierarchies

Leaning on existing intuition is much easier when a clear hierarchy of concepts is presented. Many classes felt like, to paraphrase, "one fucking concept after another", and it was hard to know where understanding from another subject or idea could be applied. However, once it's made clear, I can build of existing understanding. For instance: if I'm told that this method and this method belongs to the family of Bayesian learning methods, then I know that I can apply Bayesian model selection techniques and can build off of my existing understandings around Bayesian statistics. Then I know to look at the method or problem in Bayesian terms, i.e. with the assumptions made in that framework, and suddenly things start to make sense.

Some learning tips

Aside from patience and persistence (which are really important), two techniques I found helpful for better grasping the material were explaining things to other and reading many sources on the same concept.

Explain things to other people

When trying to explain something, you often reveal your own conceptual leaps and gaps and assumptions and can then work on filling them in. Sometimes it takes a layperson to ask "But why is that?" for you to realize that you don't know either.

Most ideas are intuitive if they are explained well. They do not necessarily need to be convoluted and dense (though they are often taught this way!). If you can explain a concept well, you understand it well.

People conceptualize ideas in different ways, relating them to their own knowledge and experiences. A concept really clicks if someone with a similar background to yours explains it because the connections will be more intuitive.

The same concepts are often present in many different fields. They may be referred to by different names or just have different preferred explanations. Seeing a concept in different contexts and introduced in different ways can provide a really rich understanding of it. If you can explain a concept in many different terms, you probably have a strong multi-perspective understanding of it.

coral metrics sketch

As part of the Coral Project, we're trying to come up with some interesting and useful metrics about community members and discussion on news sites.

It's an interesting exercise to develop metrics which embody an organization's principles. For instance - perhaps we see our content as the catalyst for conversations, so we'd measure an article's success by how much discussion it generates.

Generally, there are two groups of metrics that I have been focusing on:

• Asset-level metrics, computed for individual articles or whatever else may be commented on
• User-level metrics, computed for individual users

For the past couple of weeks I've been sketching out a few ideas for these metrics:

• For assets, the principles that these metrics aspire to capture are around quantity and diversity of discussion.
• For users, I look at organizational approval, community approval, how much discussion this user tends to generate, and how likely they are to be moderated.

Here I'll walk through my thought process for these initial ideas.

Asset-level metrics

For assets, I wanted to value not only the amount of discussion generated but also the diversity discussions. A good discussion is one in which there's a lot of high-quality exchange (something else to be measured, but not captured in this first iteration) from many different people.

There are two scores to start:

• a discussion score, which quantifies how much discussion an asset generated. This looks at how much people are talking to each other as opposed to just counting up the number of comments. For instance, a comments section in which all comments are top-level should not have a high discussion score. A comments section in which there are some really deep back-and-forths should have a higher discussion score.
• a diversity score, which quantifies how many different people are involved in the discussions. Again, we don't want to look at diversity in the comments section as a whole because we are looking for diversity within discussions, i.e. within threads.

The current sketch for computing the discussion score is via two values:

• maximum thread width: what is the highest number of replies for a comment?

These are pretty rough approximations of "how much discussion" there is. The idea is that for sites which only allow one level of replies, a lot of replies to a comment can signal a discussion, and that a very deep thread signals the same for sites which allow more nesting.

The discussion score of a top-level thread is the product of these two intermediary metrics:

$$\text{discussion score}_{\text{thread}} = \max(\text{thread}_{\text{depth}}) \max(\text{thread}_{\text{width}})$$

The discussion score for the entire asset is the value that answers this question: if a new thread were to start in this asset, what discussion score would it have?

The idea is that if a section is generating a lot of discussion, a new thread would likely also involve a lot of discussion.

The nice thing about this approach (which is similar to the one used throughout all these sketches) is that we can capture uncertainty. When a new article is posted, we have no idea how good of a discussion a new thread might be. When we have one or two threads - maybe one is long and one is short - we're still not too sure, so we still have a fairly conservative score. But as more and more people comment, we begin to narrow down on the "true" score for the article.

More concretely (skip ahead to be spared of the gory details), we assume that this discussion score is drawn from a Poisson distribution. This makes things a bit easier to model because we can use the gamma distribution as a conjugate prior.

By default, the gamma prior is parameterized with $k=1, \theta=2$ since it is a fairly conservative estimate to start. That is, we begin with the assumption that any new thread is unlikely to generate a lot of discussion, so it will take a lot of discussion to really convince us otherwise.

Since this gamma-Poisson model will be reused elsewhere, it is defined as its own function:

def gamma_poission_model(X, n, k, theta, quantile):
k = np.sum(X) + k
t = theta/(theta*n + 1)
return stats.gamma.ppf(quantile, k, scale=t)


Since the gamma is a conjugate prior here, the posterior is also a gamma distribution with easily-computed parameters based on the observed data (i.e. the "actual" top-level threads in the discussion).

We need an actual value to work with, however, so we need some point estimate of the expected discussion score. However, we don't want to go with the mean since that may be too optimistic a value, especially if we only have a couple top-level threads to look at. So instead, we look at the lower-bound of the 90% credible interval (the 0.05 quantile) to make a more conservative estimate.

So the final function for computing an asset's discussion score is:

def asset_discussion_score(threads, k=1, theta=2):
n = len(X)

k = np.sum(X) + k
t = theta/(theta*n + 1)

return {'discussion_score': gamma_poission_model(X, n, k, theta, 0.05)}


A similar approach is used for an asset's diversity score. Here we ask the question: if a new comment is posted, how likely is it to be a posted by someone new to the discussion?

We can model this with a beta-binomial model; again, the beta distribution is a conjugate prior for the binomial distribution, so we can compute the posterior's parameters very easily:

def beta_binomial_model(y, n, alpha, beta, quantile):
alpha_ = y + alpha
beta_ = n - y + beta
return stats.beta.ppf(quantile, alpha_, beta_)


Again we start with conservative parameters for the prior, $\alpha=2, \beta=2$, and then compute using threads as evidence:

def asset_diversity_score(threads, alpha=2, beta=2):
X = set()
n = 0
X = X | users
y = len(X)

return {'diversity_score': beta_binomial_model(y, n, alpha, beta, 0.05)}


Then averages for these scores are computed across the entire sample of assets in order to give some context as to what good and bad scores are.

User-level metrics

User-level metrics are computed in a similar fashion. For each user, four metrics are computed:

• a community score, which quantifies how much the community approves of them. This is computed by trying to predict the number of likes a new post by this user will get.
• an organization score, which quantifies how much the organization approves of them. This is the probability that a post by this user will get "editor's pick" or some equivalent (in the case of Reddit, "gilded", which isn't "organizational" but holds a similar revered status).
• a discussion score, which quantifies how much discussion this user tends to generate. This answers the question: if this user starts a new thread, how many replies do we expect it to have?
• a moderation probability, which is the probability that a post by this user will be moderated.

The community score and discussion score are both modeled as gamma-Poission models using the same function as above. The organization score and moderation probability are both modeled as beta-binomial models using the same function as above.

Time for more refinement

These metrics are just a few starting points to shape into more sophisticated and nuanced scoring systems. There are some desirable properties missing, and of course, every organization has different principles and values, and so the ideas presented here are not one-size-fits-all, by any means. The challenge is to create some more general framework that allows people to easily define these metrics according to what they value.

Automatically identifying voicemails

Back in 2015, prosecutor Alberto Nisman was found dead under suspicious circumstances, just as he was about to bring a complaint accusing the Argentinian President Fernández over interfering with investigations into the AMIA bombing that took place in 1994 (This Guardian piece provides some good background).

There were some 40,000 phone calls related to the case that La Nación was interested in exploring further. Naturally, that is quite a big number and it's hard to gather the resources to comb through that many hours of audio.

La Nación crowdsourced the labeling of about 20,000 of these calls into those that were interesting and those that were not (e.g. voicemails or bits of idle chatter). For this process they used CrowData, a platform built by Manuel Aristarán and Gabriela Rodriguez, two former Knight-Mozilla Fellows at La Nación. This left about 20,000 unlabeled calls.

While Juan and I were in Buenos Aires for the Buenos Aires Media Party and our OpenNews fellows gathering, we took a shot at automatically labeling these calls.

Data preprocessing

The original data we had was in the form of mp3s and png images produced from the mp3s. wav files are easier to work with so we used ffmpeg to convert the mp3s. With wav files, it is just a matter of using scipy to load them as numpy arrays.

For instance:

import scipy.io import wavfile

print(data)
# [15,2,5,6,170,162,551,8487,1247,15827,...]


In the end however, we used librosa, which normalizes the amplitudes and computes a sample rate for the wav file, making the data easier to work with.

import librosa

print(data)
# [0.1,0.3,0.46,0.89,...]


These arrays can be very large depending on the audio file's sample rate, and quite noisy too, especially when trying to identify silences. There may be short spikes in amplitude in an otherwise "silent" section, and in general, there is no true silence. Most silences are just low amplitude but not exactly 0.

In the example below you can see that what a person might consider silence has a few bits of very quiet sound scattered throughout.

There is also "noise" in the non-silent parts; that is, the signal can fluctuate quite suddenly, which can make analysis unwieldy.

To address these concerns, our preprocessing mostly consisted of:

• Reducing the sample rate a bit so the arrays weren't so large, since the features we looked at don't need the precision of a higher sample rate.
• Applying a smoothing function to deal with intermittent spikes in amplitude.
• Zeroing out any amplitudes below 0.015 (i.e. we considered any amplitude under 0.015 to be silence).

Since we had about 20,000 labeled examples to process, we used joblib to parallelize the process, which improved speeds considerably.

Feature engineering

Typically, the main challenge in a machine learning problem is that of feature engineering - how do we take the raw audio data and represent it in a way that best suits the learning algorithm?

Audio files can be easily visualized, so our approach benefited from our own visual systems - we looked at a few examples from the voicemail and non-voicemail groups to see if any patterns jumped out immediately. Perhaps the clearest two patterns were the rings and the silence:

• A voicemail file will also have a greater proportion of silence than sound. For this, we looked at the images generated from the audio and calculated the percentage of white pixels (representing silence) in the image.
• A voicemail file will have several distinct rings, and the end of the file comes soon after the last ring. The intuition here is that no one picks up during a voicemail - hence many rings - and no one stays on the line much longer after the phone stops ringing. So we consider both the number of rings and the time from the last ring to the end of the file.

Ring analysis

Identifying the rings is a challenge in itself - we developed a few heuristics which seem to work fairly well. You can see our complete analysis here, but the general idea is that we:

• Identify non-silent parts, separated by silences
• Check the length of the silence that precedes the non-silent part, if it is too short or too long, it is not a ring
• Check the difference between maximum and minimum amplitudes of the non-silent part; it should be small if it's a ring

The example below shows the original audio waveform in green and the smoothed one in red. You can see that the rings are preceded by silences of a roughly equivalent length and that they look more like plateaus (flat-ish on the top). Another way of saying this is that rings have low variance in their amplitude. In contrast, the non-ring signal towards the end has much sharper peaks and vary a lot more in amplitude.

Other features

We also considered a few other features:

• Variance: voicemails have greater variance, since there is lots of silence punctuated by high-amplitude rings and not much in between.
• Length: voicemails tend to be shorter since people hang up after a few rings.
• Max amplitude: under the assumption that human speech is louder than the rings
• Mean silence length: under the assumption that when people talk, there are only short silences (if any)

However, after some experimentation, the proportion of silence and the ring-based features performed the best.

Selecting, training, and evaluating the model

With the features in hand, the rest of the task is straightforward: it is a simple binary classification problem. An audio file is either a voicemail or not. We had several models to choose from; we tried logistic regression, random forest, and support vector machines since they are well-worn approaches that tend to perform well.

We first scaled the training data and then the testing data in the same way and computed cross validation scores for each model:

LogisticRegression
roc_auc: 0.96 (+/- 0.02)
average_precision: 0.94 (+/- 0.03)
recall: 0.90 (+/- 0.04)
f1: 0.88 (+/- 0.03)
RandomForestClassifier
roc_auc: 0.96 (+/- 0.02)
average_precision: 0.95 (+/- 0.02)
recall: 0.89 (+/- 0.04)
f1: 0.90 (+/- 0.03)
SVC
roc_auc: 0.96 (+/- 0.02)
average_precision: 0.94 (+/- 0.03)
recall: 0.91 (+/- 0.04)
f1: 0.90 (+/- 0.02)


We were curious what features were good predictors, so we looked at the relative importances of the features for both logistic regression:

[('length', -3.814302896584862),
('last_ring_to_end', 0.0056240364270560934),
('percent_silence', -0.67390678402142834),
('ring_count', 0.48483923341906693),
('white_proportion', 2.3131580570928114)]


And for the random forest classifier:

[('length', 0.30593363755717351),
('last_ring_to_end', 0.33353202776482688),
('percent_silence', 0.15206534339705702),
('ring_count', 0.0086084243372190443),
('white_proportion', 0.19986056694372359)]


Each of the models perform about the same, so we combined them all with a bagging approach (though in the notebook above we forgot to train each model on a different training subset, which may have helped performance), where we selected the label with the majority vote from the models.

Classification

We tried two variations on classifying the audio files, differing in where we set the probability cutoff for classifying a file as uninteresting or not.

in the balanced classification, we set the probability threshold to 0.5, so any audio file that has ≥ 0.5 of being uninteresting is classified as uninteresting. This approach labeled 8,069 files as discardable.
in the unbalanced classification, we set the threshold to the much stricter 0.9, so an audio file must have ≥ 0.9 chance of being uninteresting to be discarded. This approach labeled 5,785 files as discardable.

Validation

We have also created a validation Jupyter notebook where we can cherry pick random results from our classified test set and verify the correctness ourselves by listening to the audio file and viewing its image.

The validation code is available here.

Summary

Even though using machine learning to classify audio is noisy and far from perfect, it can be useful making a problem more manageable. In our case, our solution narrowed the pool of audio files to only those that seem to be more interesting, reducing the time and resources needed to find the good stuff. We could always double check some of the discarded ones if there’s time to do that.

broca

At this year's OpenNews Code Convening, Alex Spangher of the New York Times and I worked on broca, which is a Python library for rapidly experimenting with new NLP approaches.

Conventional NLP methods - bag-of-words vector space representations of documents, for example - generally work well, but sometimes not well enough, or worse yet, not well at all. At that point, you might want to try out a lot of different methods that aren't available in popular NLP libraries.

Prior to the Code Convening, broca was little more than a hodgepodge of algorithms I'd implemented for various projects. During the Convening, we restructured the library, added some examples and tests, and implemented in the key piece of broca: pipelines.

Pipelines

The core of broca is organized around pipes, which take some input and produce some output, which are then chained into pipelines.

Pipes represent different stages of an NLP process - for instance, your first stage may involve preprocessing or cleaning up the document, the next may be vectorizing it, and so on.

In broca, this would look like:

from broca.pipeline import Pipeline
from broca.preprocess import Cleaner
from broca.vectorize import BoW

docs = [
# ...
# some string documents
# ...
]

pipeline = Pipeline(
Cleaner(),
BoW()
)

vectors = pipeline(docs)


Since a key part of broca is rapid prototyping, it makes it very easy to simultaneously try different pipelines which may vary in only a few components:

from broca.vectorize import DCS

pipeline = Pipeline(
Cleaner(),
[BoW(), DCS()]
)


This would produce a multi-pipeline consisting of two pipelines: one which vectorizes using BoW, the other using DCS.

Multi-pipelines often have shared components. In the example above, Cleaner() is in both pipelines. To avoid redundant processing, a key part of broca's pipelines is that the output for each pipe is "frozen" to disk.

These frozen outputs are identified by a hash derived from the input data and other factors. If frozen output exists for a pipe and its input, that frozen output is "defrosted" and returned, saving unnecessary processing time.

This way, you can tweak different components of the pipeline without worrying about needing to re-compute a lot of data. Only the parts that have changed will be re-computed.

Included pipes

broca includes a few pipes:

• broca.tokenize includes various tokenization methods, using lemmas and a few different keyword extractors.
• broca.vectorize includes a traditional bag-of-words vectorizer, an implementation of "dismabiguated core semantics", and Doc2Vec.
• broca.preprocess includes common preprocessors - cleaning punctuation, HTML, and a few others.

Other tools

Not everything in broca is a pipe. Also included are:

• broca.similarity includes similarity methods for terms and documents.
• broca.distance includes string distance methods (this may be renamed later).
• broca.knowledge includes some tools for dealing with external knowledge sources (e.g. other corpora or Wikipedia).

Though at some point these may also become pipes.

We made it really easy to implement your own pipes. Just inherit from the Pipe class, specify the class's input and output types, and implement the __call__ method (that's what's called for each pipe).

For example:

from broca.pipeline import Pipe

class MyPipe(Pipe):
input = Pipe.type.docs
output = Pipe.type.vecs

def __init__(self, some_param):
self.some_param = some_param

def __call__(self, docs):
# do something with docs to get vectors
vecs = make_vecs_func(docs, self.some_param)
return vecs


We hope that others will implement their own pipes and submit them as pull requests - it would be great if broca becomes a repository of sundry NLP methods which makes it super easy to quickly try a battery of techniques on a problem.

broca is available on GitHub and also via pip:

pip install broca


Fellowship Status Update

I've long been fascinated with the potential for technology to liberate people from things people would rather not be doing. This relationship between human and machine almost invariably manifests in the context of production processes - making this procedure a bit more efficient here, streamlining this process a bit there.

But what about social processes? A hallmark of the internet today is the utter ugliness that's possible of people; a seemingly inescapable blemish on the grand visions of the internet's potential for social transformation. And the internet's opening of the floodgates has had the expected effect of information everywhere, though perhaps in volumes greater than anyone anticipated.

Here we are, trying to engage little tyrants at considerable emotional expense. Here we are, futilely chipping away at the info-deluge we're suspended in. Here we are, both these things gradually chipping away at us. Things people would rather not be doing.

Prior to my fellowship, these kinds of inquiries had to be relegated to off-hours skunkworks. The fellowship has given me the rare privilege of autonomy, both financial and temporal, and the resources, especially of the human kind, with which I can actually explore these questions as my job.

With the Coral Project, I'm researching what makes digital communities tick, surveying the problems with which they grapple, and learning about how different groups are approaching them - from video games to journalism to social networks both in the mainstream and the fringes (you can read my notes here). Soon we'll be building software to address these issues.

For my own projects I'm working on automatic summarization of comments sections, a service that keeps up with news when you can't, a reputation system for new social network, and all the auxiliary tools these kinds of projects tend to spawn, laying the groundwork for work I hope to continue long after the fellowship. I've been toying with the idea of simulating social networks to provide testing grounds for new automated community management tools. The best part is that it's up to me whether or not I pursue it.

A huge part of the fellowship is learning, which is something I love but have had to carve out my own time for. Here, it's part of the package. I've had ample opportunity to really dig into the sprawling landscape of machine learning and AI (my in-progress notes are here), something I've long wanted to do but never had the space for.

The applications for the 2016 fellowship are open, and I encourage you to apply. Rub shoulders with fantastic and talented folks from a variety of backgrounds. Pursue the questions that conventional employment prohibits you from. Explore topics and skills you've never had time for. It's really what you make of it. At the very least, it's a unique opportunity to be deliberate about where your work takes you.

The halfway mark of my OpenNews fellowship has just about passed. I knew from the start the time would pass by quickly, but I hadn't considered how much could happen in this short a time. There are only about 5 months left - the fellowship does end, but, fortunately, the work it inaugurates doesn't have to.