• 18th Dec, 2014
  • Francis
  • etc
Image source, modified and licensed under CC BY-SA 3.0.

The class I've been teaching this fall at the New School, News Automata, didn't require any previous programming experience, so I wanted to find a way to teach my students some basics. I had a plan where I'd get them setup and rolling with the basics and then transition into a journalism-related program involving some text processing libraries.

But when it came time to teach the class, almost the entire two hours were spent setting up the students' development environments. We were working in a school computer lab so students didn't have access to the permissions they might need to install packages and what not. I figured out a decent workaround, but even then, if you aren't already familiar with this stuff, environment configuration is super tedious and frustrating. It's often that way even if you are already familiar with this stuff.

Later, we tried again: this time students brought in their own computers. But of course, everyone had different systems - OSX, Windows, ChromeOS...and I realized that all the tools which make env setup easier (package managers and so on) require their own setup which just complicated things even further. And the packages I wanted to get the students using involved some lower-level libraries, such as numpy and scipy, required compiling which could take ages depending on what hardware the student had. By the time most of the students had their environment setup, everyone was exhausted and dispirited. What a mess.

So to make this all a bit easier, I put together a system which allows students to share a remote development environment. All I had to do was setup the environment once on my own server, and then students could submit code to it through a web interface. As long as the student had a browser, they could write their scripts. It was a huge, huge time saver - everyone could dive right into playing with some code.

I cleaned up the project and turned it into a package to reuse: Pasture. Now you can install it like you would any pypi package and build something on top of it.

I've included an example from the class I was teaching. I had students try their hand at writing some simple newsfeed filtering algorithms. Building on top of Pasture, students could submit a script which included a filtering function and see a newsfeed filtered in real(-ish) time:

Check out the GitHub repo for usage instructions and the full example.



  • 14th Dec, 2014
  • Francis
  • etc

I'm a big supporter of decentralized internet services but I used to use Evernote because it was really convenient. That's always the problem isn't it? Some things are just so convenient ¯\_(◔ ‿ ◔)_/¯. A few months ago I began using Linux a lot more and when I discovered that there was neither a Linux Evernote client, nor were there any plans to release one. it seemed like a good opportunity to make the transition to another service.

But I didn't really like any of the other available services, and since most of them were still centralized, so I ended up making my own: Nomadic.

It doesn't pack nearly as many features as Evernote does, but that's ok because I think Evernote has too many features anyways. Nomadic can still do quite a lot:

  • GitHub-flavored Markdown
  • Some extra markdown features: highlighting, PDF embedding
  • Syntax highlighting
  • MathJax support
  • Automatically updates image references if they are moved
  • Full-text search (html, txt, markdown, and even PDFs)
  • A rich-text editor - I use it for copying stuff from webpages, though it's not the best at that. It does, however, convert pasted web content into Markdown.
  • A browsable site for your notes (Markdown notes are presented as HTML)
  • A complete command line interface
  • Notes can be exported as web presentations

For someone familiar with setting up web services, Nomadic is pretty easy to install. I'll refer you to the readme for instructions. For anyone else, it's kind of tricky. It would be great to make it non-dev friendly, and I might work on this later if people start asking for it.

Basically, all you do is manage a directory of notes, in Markdown or plaintext. The Nomadic daemon, nomadic-d, indexes new and changed notes, updates resource references if notes move, and runs a lightweight web server for you to browse and search through your notes. That's it.

If you want remote access, you can host Nomadic on a server of your own and put some authentication on it. Then you can SSH in to edit notes and what not. Down the line this kind of support would ideally be built into Nomadic and simplified for less technical users.

Alternatively, you can do what I do - use BitTorrent Sync, which is an application built on top of the BitTorrent protocol for decentralized "cloud" storage. So I sync my Nomadic notes folder across all my devices and can edit them locally on each. On my Android phone I use JotterPad to view and edit notes.

There are a bunch of tweaks and improvements that can be made, but I've been using it for the past four months or so and have felt no need to return to Evernote :)


Argos: Clustering

  • 11th Dec, 2014
  • Francis
  • etc

In it's current state, Argos is an orchestra of many parts:

  • argos - the core project
  • - the infrastructure deployment/configuration/management system
  • argos.corpora - a corpus builder for training and testing
  • argos.cluster, now galaxy - the document clustering package
  • argos.ios and - the mobile apps

It's an expansive project so there are a lot of random rabbit holes I could go down. But for now I'm just going to focus on the process of developing the clustering system. This is the system which groups articles into events and events into stories, allowing for automatically-generated story timelines. At this point it's probably where most of the development time has been spent.

Getting it wrong: a lesson in trying a lot

When I first started working on Argos, I didn't have much experience in natural language processing (NLP) - I still don't! But I have gained enough to work through some of Argos's main challenges. That has probably been one of the most rewarding parts of the process - at the start some of the NLP papers I read were incomprehensible; now I have a decent grasp on their concepts and how to implement them.

The initial clustering approach was hierarchical agglomerative clustering (HAC) - "agglomerative" because each item starts in its own cluster and are merged sequentially by similarity (the two most similar clusters are merged, then the next two similar clusters are merged, etc), and "hierarchical" because the end result is a hierarchy as opposed to explicit clusters.

Intuitively it seemed like a good approach - HAC is agnostic to how similarity is calculated, which left a lot of flexibility in deciding what metric to use (euclidean, cosine, etc) and what features to use (bag-of-words, extracted entities, a combination of the two, etc). The construction of a hierarchy meant that clustering articles into events and clustering events into stories could be accomplished simultaneously - all articles would just be clustered once, and the hierarchy would be snipped at two different levels: once to generate the event, and again at a higher level to generate the stories.

Except I completely botched the HAC implementation and didn't realize it for waaay too long. The cluster results sucked and I just thought the approach was inappropriate for this domain. To top it off, I hadn't realized that I could just cluster once, snip twice (as explained above), and I was separately clustering articles into events and events into stories. This slowed things down a ton, but it was already super slow and memory-intensive to begin with.

Meanwhile I focused on developing some of the other functionality, and there was plenty of that to do. I postponed working on the clustering algorithm and told myself I'd just hire an NLP expert to consult on a good approach (i.e. I may never get around to it).

A few months later I finally got around to revisiting the clustering module. I re-read the paper describing HAC and then it became stunningly obvious that my implementation was way off base. I had some time off with my brother and together we wrote a much faster and much simpler implementation in less than an hour.

But even with that small triumph, I realized that HAC had, in this form, a fatal flaw. It generates the hierarchy in one pass and has no way of updating that hierarchy. If a new article came along, I had no choice but to reconstruct the hierarchy from scratch. Imagine if you had a brick building and the only way you could add another brick was by blowing the whole thing up and relaying each brick again. Clustering would become intolerably slow.

I spent awhile researching incremental or online clustering approaches - those which were well-suited to incorporating new data as it became available. In retrospect I should have immediately begun researching this kind of algorithm, but 6 months prior I didn't know enough to consider it.

After some time I had collected a few approaches which seemed promising - including one which is HAC adapted for an incremental setting (IHAC). I ended up hiring a contractor (I'll call him Sol) who had been studying NLP algorithms to help with their implementation (I didn't want to risk another botch-implementation). Sol was fantastic and together we were able to try out most of the approaches.

IHAC was the most promising and is the one I ended up going with. It's basically HAC with a modifiable hierarchy. The hierarchy can take a new piece of data and minimally restructure itself to incorporate it.

I rewrote Sol's implementation (mainly to familiarize myself with it) and started evaluating it on a test data, trying to narrow down a set of parameters well-suited for news articles. It was pretty slow so I tried to parallelize it, but just a second process was enough to run into memory issues. After some profiling and rewriting of key memory bottlenecks, memory usage was reduced 75-95%! So now certain parts could be parallelized. But it still was quite slow, mainly because it was built using higher-level Python objects and methods.

I ended up rewriting the implementation again, this time moving as much as I could to numpy and scipy, very fast scientific computing Python libraries where a lot of the heavy lifting is done in C. Again, I saw huge improvements - the clustering went something like 12 to 20 times faster!

Of course, there were still some speedbumps along the way - bugs here and there, which in the numpy implementation were a bit harder to fix. But now I have a solid implementation which is fast, memory-efficient, persistent (using pytables), and takes full advantage of the algorithm's hierarchical properties (getting events and stories in just two snips).

For the past few days Argos has been in a trial period, humming on a server collecting and clustering articles, and so far it has been doing surprisingly well. The difference between the original implementation and this new one is night and day.

At first Argos was only running on world and politics news, but today I added in some science, tech, and business news sources to see how it will handle those.

It was a long and exhausting journey, but more than anything I'm happy to see that the clustering is working well and quickly!



  • 9th Dec, 2014
  • Francis
  • etc
Some of Argos' onboarding screens.

Almost four years ago I got it in my head that it would be fun to try to build a way to automatically detect breaking news. I thought of creating a predictive algorithm based on deviations of word usage across the internet - if usage of the word "pizza" suddenly deviated from its historical mean, something was going on with pizza! Looking back, it was a really silly idea, but it got me interested in working programmatically with natural language and eventually (somehow) morphed into Argos, the news automation service I started working on a little over a year ago. Since I recently got Argos to a fairly-well functioning tech demo stage, this seems like a good spot to reflect on what's been done so far.

The fundamental technical goal for Argos is to automatically apply structure to a chaotic news environment, to make news easier for computers to process and for people to understand.

When news pundits talk about the changing nature of news in the digital age, they often try to pinpoint the "atomic unit" of news. Stretching this analogy a bit far, Argos tries to break news into its "subatomic particles" and let others assemble them into whatever kind of atom they want. Argos can function as a service providing smaller pieces of news to readers, but also as a platform that other developers can build on.

The long-term vision for Argos is to contribute to what I believe is journalism's best function - to provide a simulacrum of the world beyond our individual experience. There are a lot of things standing in the way of that goal. This initial version of Argos focuses on the two biggest obstacles: information overload and complex stories that span long time periods.

At this point in development, Argos watches new sources for new articles and automatically groups them into events. It's then able to take these events and build stories out of them, presented as timelines. As an example, the grand jury announcing their verdict for Darren Wilson would be one event. Another event would be Darren Wilson's resignation, and another would be the protests which followed the grand jury verdict in Ferguson and across the country. Multiple publications reported on each of these events. A lot of that reporting might be redundant, so by collapsing these articles into one unity, Argos eliminates some noise and redundancy.

Argos picked these events out of a few weeks worth of news stories and automatically compiled an event summary. These screenshots are from the Argos Android test app.

These events would all be grouped into the same story. The ongoing protests around Eric Garner's murder would also be an event but would not necessarily be part of the same story, even though the two are related thematically.

A five-point summary is generated for each event, cited from that event's source articles. Thus the timeline for a story functions as an automatically generated brief on everything that's happened up until the latest event. The main use case here is long-burning stories like Ferguson or the Ukraine conflict which often are difficult to follow if you haven't been following the story from the start.

Argos can also see what people, places, organizations, and other key terms are being discussed in an event, and instantaneously provide information about these terms to quickly inform or remind readers.

Argos detects concepts discussed in an event and can supplement some information, sourced from Wikipedia.

Finally, Argos calculates a "social (media) importance" score for each event to try and estimate what topics are trending. This is mainly to support a "day-in-brief" function (or "week-in-brief", etc), i.e. the top n most "important" (assuming talked about == important) events of today. Later it would be great to integrate discussions and other social signals happening around an event.

I've been testing Argos mainly with world and political news (under the assumption that those would be easier to work with for technical reasons). So far that has been working well, so I recently started trying some different news domains, though it's too early to say how that's working out.

The API is not yet public and I'm not sure when it will be officially released. At the moment can't devote a whole lot of time to the project (if you're interested in becoming involved, get in touch. Argos does have an unreleased Android app (and an older version for iOS) which at this point is mainly just a tech demo for small-scale testing. Frankly, I don't know if Argos will work best as a consumer product or some intermediary technology powering some other consumer service.

(Later I'll write a post detailing the development of Argos up until now.)


Unity and PureData: Procedural Soundtracks

  • 7th Dec, 2014
  • Francis
  • code

For The Founder, Johann and I decided to use a system of generative music. The approach we're trying is pairing PureData with Unity.

Kalimba is a library for Unity which gives you basic access to PureData (via libpd). The project is a couple years old but still seems to hold up, though it was tricky to setup. It comes with an example but that could use updating. These instructions should be a tad easier to follow.

Unity setup

First, clone the Kalimba repo:

git clone ~/kalimba

For these steps I'm going to set some variables to make things simpler:


Copy over the Kalimba Unity plugin:

cp -r $K_ASSETS_DIR/Plugins/Kalimba $ASSETS_DIR/Plugins/

Finally, copy over the StreamingAssets folder, where you'll keep your PureData files:

cp -r $K_ASSETS_DIR/StreamingAssets $ASSETS_DIR/

To try things out you can copy over the demo script and attach it to a GameObject:

cp $K_ASSETS_DIR/Scripts/AudioTest.cs $ASSETS_DIR/Scripts/

Then you can run your game scene in the editor and you should be presented with some buttons to test the functionality.


Setup a directory to work in:

mkdir -p app/lib
mkdir app/src

Install ant to compile our custom activity's jar.

brew install ant

Copy over the other classes from Kalimba:

cp -r $KALIMBA_DIR/android-exampleapp/{org,com} app/src/

I also remove the example:

rm -rf app/src/com/bitbarons/exampleapp

Then symlink the Android and Unity libraries (change these paths to match your own):

cd app/lib
ln -s ~/Downloads/android-sdk-macosx/platforms/android-21/android.jar
ln -s /Applications/Unity/

Now create your custom activity. For me, this was:

cd ..
mkdir -p src/co/publicscience/thefounder
vi src/co/publicscience/thefounder/

And you can keep its contents simple:

package co.publicscience.thefounder;

import com.bitbarons.kalimba.KalimbaActivity;

public class MainActivity extends KalimbaActivity {

Next, copy the default Unity AndroidManifest.xml and update it to point to your custom activity. E.g. replace com.unity3d.player.UnityPlayerNativeActivity with .MainActivity.

cp /Applications/Unity/ AndroidManifest.xml
vi AndroidManifest.xml

Then you need a build.xml to instruct ant on how to process everything. This one should do nicely:

<?xml version="1.0" encoding="UTF-8"?>
<project default="jar">

    <path id="classpath">
        <fileset dir="libs" includes="**/*.jar"/>
    <target name="clean">
        <delete dir="bin/"/>
    <target name="compile">
        <mkdir dir="bin"/>
        <javac srcdir="src" destdir="bin">
            <classpath refid="classpath"/>
    <target name="jar">
        <jar destfile="customActivity.jar" basedir="bin"></jar>

Then we can compile and build our jars:

ant clean
ant compile
ant jar

In Unity, make absolutely sure that the bundle identifier for your build matches the one for your main activity, e.g. co.publicscience.thefounder. You can check with File > Build Settings, then select "Android" and hit "Player Settings". Then in "Other Settings", look for "Identification" and check the "Bundle Identifier". Your "Minimum API Level" here should also match the one specified in your AndroidManifest.xml.

Next, copy over from the Kalimba repo into your Unity Assets folder:

mkdir -p $ASSETS_DIR/Plugins/Android
cp $K_ASSETS_DIR/Plugins/Android/ $ASSETS_DIR/Plugins/Android/

Then copy your custom activity jar and manifest over:

cp {customActivity.jar,AndroidManifest.xml} $ASSETS_DIR/Plugins/Android/

It's also worth noting that there's about a ~2s audio delay in Android. For our purposes it's tolerable.

Before building to Android, take note of the Caveats section below.


For iOS, we use a Python script to handle a lot of this configuration automatically. It takes a bit of setup:

We created a folder called Build in our Unity project root to put all our iOS build stuff:

mkdir Build

Then you can setup the script:

vi Build/

with the contents:

import re
import os
import shutil

# Change these for your needs.
BUILD_PATH  = 'ios_build'
LIBPD_PATH  = 'ios_libpd'
PD_FILE     = 'kalimbaTest.pd'
ASSETS_PATH = '../Assets'

if __name__ == '__main__':

    print('Modifying the code...')

    # filepath: [(line number, code),...]
    modifications = {

    'Classes/': [
        (67, '@synthesize audioController = audioController_;\n\n'),
        (89, '''
        self.audioController = [[[PdAudioController alloc] init] autorelease];
        [self.audioController configurePlaybackWithSampleRate:44100 numberChannels:2 inputEnabled:YES mixingEnabled:NO];

        [self.audioController configureTicksPerBuffer:128];

        [PdBase openFile:@"{pd_file}" path:[[NSBundle mainBundle] resourcePath]];
        [self.audioController setActive:YES];
        [self.audioController print];\n'''.format(pd_file=PD_FILE))

    'Classes/UnityAppController.h': [
        (6, '#import "PdAudioController.h"\n#import "PdBase.h"\n'),
        (27, '\n@property (nonatomic, retain) PdAudioController *audioController;\n')


    for file, injections in modifications.items():
        filepath = os.path.join(BUILD_PATH, file)
        with open(filepath, 'r') as f:
            contents = f.readlines()

        for index, code in injections:
            contents.insert(index, code)

        with open(filepath, 'w') as f:

    print('Copying over files...')
    patches_path = os.path.join(BUILD_PATH, 'Patches')
    shutil.copy(os.path.join(ASSETS_PATH, 'StreamingAssets/pd', PD_FILE), patches_path)
    shutil.copytree(LIBPD_PATH, os.path.join(BUILD_PATH, 'ios_libpd'))

    print('Copying the prepared pbxproj file...')
    proj_path = os.path.join(BUILD_PATH, 'Unity-iPhone.xcodeproj/project.pbxproj')
    shutil.copy('project.pbxproj', proj_path)

    print('All finished!')

The script reuses an existing project.pbxproj file in later builds. That way we don't have to reconfigure and re-add everything by hand. Another part of the script copies over files and injects in some additional setup code for Kalimba.

We have to setup the iOS Xcode project manually once, and then save its resulting project.pbxproj file.

In Unity, build your iOS project (not "Build & Run", just "Build") and save it to Build/ios_build:

Then copy over the ios_libpd directory from Kalimba (note that the name has changed from ios-libpd to ios_libpd):

cp -r $KALIMBA_DIR/ios-libpd Build/ios_libpd

This is so the script knows where to find it.

Then copy it again to the iOS build folder:

cp -r Build/ios_libpd Build/ios_build/

And then make a directory for your PD patch and copy it over:

mkdir Build/ios_build/Patches
cp Assets/StreamingAssets/pd/kalimbaTest.pd Build/ios_build/Patches/

Then open the resulting Xcode project:

open Build/ios_build/Unity-iPhone.xcodeproj

and add the Patches and ios_libpd folders to the project. Make sure you have the check box in "Add to Targets" checked next to the "Unity-iPhone" entry.

Then go to the project's "Build Settings", find the "Other C Flags" build setting, and set it to be:


for both Debug and Release configurations. So it should look like:

That's all for XCode - now quit and then copy the project's project.pbxproj to the Build folder:

cp Build/ios_build/Unity-iPhone.xcodeproj/project.pbxproj Build/project.pbxproj

The script will know to find it here and copy it over.

Try running the script and it should apply the proper changes:

cd Build
> Modifying the code...
> Copying over files...
> Copying the prepared pbxproj file...
> All finished!

Open the Xcode project again and you should see changes applied to UnityAppController.h and See the contents of the for more info on what changes were made.

Now everytime you are building to iOS, don't use "Build & Run". Instead, use "Build", save it to Build/ios_build, then:

cd Build

And you can then open up the Xcode project and build to your device from there!


If you try to build for Android or iOS, you may run into the error that System.Net.Sockets requires Unity Pro. See this issue. You can just comment out line 21 in Assets/Plugins/Kalimba/KalimbaPd.cs:

impl = new KalimbaPdImplNetwork();

and remove the file which imports System.Net.Sockets (Assets/Plugins/Kalimba/KalimbaPdImplNetwork.cs). Unfortunately, with these parts removed you will not be able to hear sounds produced by Kalimba/PD when playing a scene in the editor.

For more info on setting up Unity/PD/Kalimba, check out this post and this post.


News Automata

  • 6th Dec, 2014
  • Francis
  • etc

Since mid-September I've been teaching at The New School as part of their new Journalism+Design program. Heather Chaplin, who organizes the program, brought me in to teach a new class I put together called News Automata, which explores the technical underpinnings, ethical concerns, and potential benefits/pitfalls of technology in journalism.

The way I ended up teaching the class deviated a bit from my original plan. For reference, here's the initial syllabus:

News Automata Fall 2014 syllabus

Week 1: Introduction to the Course

Introduction to journalism and how technology is affecting it both as an industry and as a practice. What is the point of journalism? We'll talk about a few of the major definitions of journalism's responsibility.

We'll also talk about the format of the course and what'll be expected.

Some things we'll go over:

  • Bias in journalism? Is "fair and balanced" actually fair?
  • Does journalism work?
  • Varying definitions of journalism: the spotlight, the watchdog, the entertainer, the simulacrum, the influencer, the activator, the business

Week 2-3: Leveraging networks: from consumers => producers

How are individuals becoming empowered to participate in the production of news, whether through whistleblowing or "citizen journalism"? How does news "break" in the age of Twitter? How are journalists collaborating with on-the-ground sources to have direct access to those affected, in real time? How are online communities becoming involved in the investigative process? We'll discuss some of the challenges in this area, such as verifying accuracy and security, and look at how things can go wrong.

Some things we'll go over:

Week 4: Bots and drones: the automated assemblage

Automation is creeping into more and more parts of our lives. Jobs which existed a few decades ago are now obsolete. Jobs which exist now will likely be obsolete in an even shorter amount of time. Will journalists be one of them?

Some things we'll go over:

Week 5: Information overload and context

The information age is a blessing and a curse. What good is all that information if you don't have the time, energy, or attention to make use of it? What approaches are being used to make the news easier to digest and give us the fullest understanding of what's happening in the world? Technology and design are both about getting more with less. How can we get the maximum impact from the smallest amount of information?

Some things we'll go over:

Week 6: Engineering virality and control over networks

Facebook wants you to be happy, BuzzFeed knows you like lists and quizzes, and Upworthy understands how to tease you into clicking a link. They have all been immensely successful. Is that all these companies are about? Or is there something larger at play? What happens when all news is designed just to get us to click into it? Or has that already happened?

Some things we'll go over:

Week 7: The Filter Bubble and the new gatekeepers

The most prominent approach to managing information overload is self, algorithmic, or thought-leader curation. But the nature of these three filtering mechanisms leads many to worry – are we just seeing more of the same? How does that affect how we think about the world? Is that fundamentally antithetical to journalism's aspirations as a practice?

Some things we'll go over:

Week 8: Taking action

Journalism is more than about knowing what's happening. It's also about acting on that knowledge. How can we design systems to tighten that action-intent gap? How can we make it easier for people to organize and act on the issues they feel strongly about?

Some things we'll go over:

Here's what the classes ended up being about:

  • What does journalism hope to accomplish?
  • Social networks and news production: crowdsourcing/citizen journalism, problems of verification, perspectives, popular curation vs the gatekeeper model
  • Automation, news bots, intro to machine learning, intro to natural language processing
  • Communities and discussions online, anonymity, and bits of behavioral economics, game theory, decision theory, group dynamics, sociological/psychological research for understanding online behavior
  • How elections are reported (the week after midterm elections), how algorithmic filters work, filter bubbles/"personalized propaganda"
  • Hands-on: building our own news feeds and algorithmic filters with Python
  • The maturing medium of video games (narrative, mechanics, aesthetic, and technical perspectives) and how it relates to journalism

There are still two more classes in which I plan on covering:

  • The physical basis of the internet (an overview on its infrastructure and the politics of that infrastructure)
  • Taking action and digital impact IRL: slacktivism, hacktivism (Anonymous, Wikileaks), doxxing, DDoS/LOIC, etc

This was my first time teaching a class so it's been a great learning experience for me, and my students were great. I'm hoping to teach it again next fall. The class was public which ended up being a boon (I was worried about it at first) - lots of people from all sorts of backgrounds stopped in for a class or two and had interesting contributions to our discussions.

I'll put up some of the lectures and materials soon.