Pablo

The Avalanches' _Since I Left You_
The Avalanches' Since I Left You

My favorite album of all time is The Avalanches' Since I Left You. It's a masterpiece of "plunderphonics", which essentially means "sound collage"; that is, constructing songs purely out of broadly sourced samples. Since the album came out 15 years ago (!!), there has been promise of a second release. This release has yet to materialize.

In this endless interim I wrote Pablo (named after Pablo's Cruise, which is a song on Since I Left You and was the album's original title), a program for automatically generating plunderphonic music. It's output is certainly nothing of Avalanches' caliber, but it's meant more for rapidly sketching ideas than for generating fully-fleshed tracks.

The GitHub repo for Pablo has instructions for installation. Here I'll talk a bit more about how it works.

Selecting and preprocessing the songs

Song selection
Song selection

The approach is pretty simple - you direct Pablo to a directory of music (mp3s or wavs) and it will start analyzing the tracks, approximating their BPMs and keys for later reference. Almost all of Pablo's audio analysis capabilites rely directly on the Essentia library.

Pablo then randomly picks a "focal" song - the song the rest of the mix is based off of. Pablo goes through the other songs available in the directory and randomly chooses a few that are within a reasonable distance from the focal song in both key and BPM (to avoid ridiculous warping, although maybe you want that to happen).

These other songs are pitch-shifted and time-stretched to conform to the focal song. Then, in each track, beat onset detection happens to identify downbeats. These downbeats are used to slice the tracks into samples of different sizes (the sizes must be a power of 2, e.g. 4, 8, 16, 32, etc). By default these samples are 16 beats and 32-beats.

A challenge here was beatmatching. For the most part, Pablo does pretty well at identifying proper beat onsets. But some samples become irregular - if there is an instrumental bridge, for instance, or a long intro, then Pablo may not cut that part into properly-sized samples. So there is an additional step where Pablo finds the mode sample duration (in milliseconds) and discards all samples not of this length. That way we have greater assurance that when the samples are finally assembled into tracks, they will align properly (since all samples of the same beat length will have the exact same time duration).

Assembling the mix

With all the samples preprocessed, Pablo can begin putting the tracks together. To do so, Pablo just places samples one after the another using the stochastic process outlined below.

A song length is specified in beats, which also must be a power of 2. For instance, we may want a song that is 64-beats long. Pablo then recursively goes down by powers of 2 and tries to "fill in" these beats.

For example, say we want a 64-beat long song and we have a samples of size 32, 16, and 8. Pablo first checks to see if a 64-beat sample is available. One isn't, so then it knows it needs two 32-beat samples (i.e. we have two 32-beat slots to fill). A 32-beat sample is available, so for each 32-beat slot Pablo randomly decides whether or not to use a complete 32-beat sample or to further split that slot into two 16-beat slots. If Pablo chooses the latter, the same process is again applied until the smallest sample size is reached. So in this example, Pablo would stop at two 8-beat slots since it does not have any samples smaller than 8 beats.

Selecting samples
Selecting samples

Pablo also has some heuristics to ensure a degree of coherence in the tracks it creates. Without this, Pablo's tracks would be too spastic, with sudden cutoffs and changes in samples every bar. So to decide what sample should follow the current sample, Pablo uses a simple Markov chain model which favors staying in the same song and favors repeating the same sample.

Markov Chain Coherence
Markov Chain Coherence

Each track is constructed in this way. Pablo creates (by default) two tracks which are then overlaid to form the final mix. However, sometimes samples from the same song are overlaid each other, which isn't quite what I wanted, so one final heuristic is implemented. When constructing tracks after the first, Pablo will try its best to avoid placing samples from the same song over each other. Sometimes this is unavoidable, such as when there are more tracks being made than there are songs. But for the most part it works out.

That's really about it in terms of generating the songs. I've also been playing a bit with vocal detection, since hearing multiple people singing can be disorienting, and I would like to add in some genre-identification features as well.

The Infinite Playlist

One final feature is the "crate digging" feature, in which Pablo takes a seed YouTube url and crawls its related videos, quickly amassing some songs to sample. Eventually I'd like it so that Pablo can become the infinite playlist, crawling YouTube (and perhaps SoundCloud and other sources) and endlessly mixing songs into each other.

An example song

Here's a song that Pablo produced:

It's a bit hectic, but for the most part beats match well and there are some interesting moments in there. Pablo outputs sample listings with times, along with the original samples, so it's easy to recreate the parts that you like.

Source

You can check out how all this is implemented in the GitHub repo, along with installation instructions to run Pablo yourself!


Port

I've had a lot of trouble finding a blogging platform which I enjoyed using - they were either too heavy or required using a web editor or were tricky to customize and theme. So last week I put together a super lightweight blogging platform called port.

port is the centralized complement to nomadic, which is a lightweight note management system. Like nomadic, port uses directory structure to define its routing structure and works primarily off markdown documents.

Here's what port offers:

  • Posts are GitHub-flavored markdown files, with MathJax support
  • A blog is represented as a directory, with subdirectories of markdown posts
  • Post categories are defined by the folders they are in
  • RSS feeds for individual categories and for all posts
  • Very easy theming (Jinja2 templates)
  • Managed via a CLI (the port command)
  • No database necessary!
  • No admin UI necessary!

There are a couple additional convenience features as well:

  • port sync makes it easy to sync your local site files to a remote machine (e.g. your server)
  • port host generates the config files necessary to host your blog with uwsgi, supervisor, and nginx
  • port serve serves the site locally, which is useful for previewing your work

Tutorial

Here's how to get going with port. For this example I'll specify my site's directory as ~/spaceandtimes.

Install

pip install port

Create a site (I'll use my site as an example) and go through the config

port create spaceandtimes

Setup a category (just renaming the default category port creates)

mv ~/spaceandtimes/default_category ~/spaceandtimes/coding

Add an image to the assets folder

mv ~/some_image.png ~/spaceandtimes/assets/

Create a post (using vi here but use whatever editor you'd like)

vi ~/spaceandtimes/coding/my_first_post.md

---
published_at: 06.30.2015 10:53
---

Hello!

Here's an image:
![some alt text](/assets/some_image.png)

Build the site

port build spaceandtimes

Serve the site locally

port serve spaceandtimes

Check it out at localhost:5005!

To host the site on a server, follow these steps (this assumes an Ubuntu server):

Install the right stuff

sudo apt-get update
sudo apt-get install uwsgi supervisor uwsgi-plugin-python uwsgi-plugin-python3 nginx

Install port on the server

pip install port

Create the site on the server (again, say we put the site at ~/spaceandtimes):

port create spaceandtimes

Then run the host command to automatically generate the necessary configs and restart the nginx and supervisor processes (the last argument is the user that the site was generated under):

port host spaceandtimes blog.spaceandtimes.com ftseng

From your local machine, sync your site (but first, build your site locally):

port sync spaceandtimes user@my_server.com:~/spaceandtimes

Workflow

My workflow with port is to write posts locally, build locally, and then sync to my server. It's super fast and easy :)

So basically:

vi ~/spaceandtimes/etc/my_new_post.md
port build spaceandtimes
port sync spaceandtimes ftseng@myserver.com:~/spaceandtimes

And that's it!

See the GitHub repo for more info on theming and other details.


Clustering news articles with hs_cluster

This week I returned to tackling the news clustering problem I had been exploring last year.

The general goal is to take in groups of news articles, and cluster them so that news articles talking about the same event are grouped together.

Before, I used a clustering technique that was an incremental form of hierarchical agglomerative clustering (IHAC, implementation available here) which seemed better suited for the task compared to more popular approaches like k-means and DBSCAN. However, that technique had its own problems.

The problem

The clustering quality still was not where I needed it to be - in particular, many of the resulting article clusters (which I call "events") included articles which were unrelated to the others.

This is a problem of cluster homogeneity, which asks: how many members are in clusters with members from other true clusters? The complement of this is completeness, which asks: how many members that belong together (in the true clustering) were clustered together? Lower completeness is tolerable - people are less likely to notice the absence of an article that belongs than the presence of an article that doesn't (which is what homogeneity measures).

As a simple illustration, imagine the true clusters:

$$ [A_1, A_2, A_3], [B_1, B_2, B_3] $$

The following clustering has a homogeneity score of 1 (the best score) and a completeness score of $\approx 0.685$.

$$ [A_1, A_2], [A_3], [B_1, B_2, B_3] $$

The following clustering has a completeness score of 1 (the best score) and a homogeneity score of $\approx 0$.

$$ [A_1, A_2, A_3, B_1, B_2, B_3] $$

For my task, the former is preferable to the latter.

Note that the following clustering has a homogeneity score of 1, but a completeness score of $\approx 0.387$.

$$ [A_1], [A_2], [A_3], [B_1], [B_2], [B_3] $$

We don't want to achieve good homogeneity at the total expense of completeness, as is the case here!

Parameter selection

The severity of this problem was mostly affected by where I decided to snip the cluster hierarchy that IHAC produced. The problem is that this value is difficult to determine in advance as it may vary for different sets of articles.

This dilemma - known as parameter selection - is one that crops up a lot in clustering. For k-means, it's the question of the value for $k$; for DBSCAN, it's the value of $\epsilon$ under concern.

For IHAC my selection heuristic was based on the average density of branch nodes, but that was not working well.

In the case of k-means and DBSCAN, this problem of parameter selection is an area of active research, so there are a few promising techniques to start with.

DBSCAN parameter selection

For DBSCAN, one technique (outlined here) involves looking at the nearest neighbor for each point and identifying some majority value for $\epsilon$ based on that.

The technique I'm using is similar - it involves taking the $n$ nearest neighbors for each point, looking at the distances between them, and identifying the largest "jump" in distance. The average of the distances that precede those jumps in difference is used as $\epsilon$.

(Though I also wonder if $\epsilon$ is a parameter that could be learned; that is, if news events tend to have a similar article density.)

k-means parameter selection

For k-means, one effective approach is outlined here, called "$f(K)$", which runs k-means iteratively with different values of $k$ and computes scores for them (see the article for details).

You do need to specify an upper limit for $k$, which in theory could just be $n$ (the number of documents), but practically speaking, this can be very slow and it's unlikely that every document will belong to its own cluster. I've been using $n//3 + 1$ under the assumption that the events have at least 3 news articles each.

For k-means, getting $k$ right is especially important. DBSCAN has a concept of outliers so points which don't seem to belong to clusters will be treated as noise. But for k-means, every point must be assigned to a cluster, which can be problematic.

Evaluating these approaches

To evaluate these approaches, I needed some ground truth events. Fortunately, there is a resource for this: Wikinews.

On Wikinews, contributors write entries about news events and supply a list of sources for each. XML dumps of these pages are available - so I put together a program called focusgroup which parses these dumps, applies a few heuristics to weed out noise events - those in other languages, with too few sources, with sources which are no longer available, with unreliable sources, etc. - and then downloads and extracts the source pages' main content into a database.

Evaluation datasets of 10, 20, and 30 randomly selected clusters/events were built from this data.

What follows are the scores for k-means with automated parameter selection, DBSCAN with automated parameter selection, and my hs_cluster approach (which I'll explain below) on the 10, 20, and 30 event datasets (which I'll call 10E, 20E, and 30E, respectively).

For k-means and DBSCAN, documents are represented using TF-IDF weighted vectors (see the vectorizer source for details).

(Note that k-means is not included for the 30 event dataset because automated parameter selection seemed to run indefinitely.)

Approach # Clusters Completeness Homogeneity Time
10E kmeans_cluster 7 0.9335443486 0.7530130785 6.65s
dbscan_cluster 6 1.0 0.7485938828 0.34s
hs_cluster 12 0.9287775491 0.9527683893 2.79s
20E kmeans_cluster 16 0.9684967878 0.8743004102 12.08s
dbscan_cluster 16 0.9580321015 0.8637897487 0.52s
hs_cluster 35 0.8677638864 0.9888333740 4.42s
30E kmeans_cluster n/a n/a n/a n/a
dbscan_cluster 7 0.9306925670 0.2568053596 7.13s
hs_cluster 60 0.8510288789 0.9539905161 11.27s

None of the approaches reliably get close to the number of true clusters. But what's important is that the hs_cluster approach scores much better on homogeneity (at the expense of completeness, but again, that's ok).

The hs_cluster approach

These automatic approaches are okay at finding decent values for these parameters, since these scores are not terrible - but they are clearly still weak on the homogeneity front.

An approach that doesn't require specification of a parameter like $k$ or $\epsilon$ would be ideal.

I've developed an approach which I call hs_cluster (the reason for the name is left as an exercise for the reader) which has been effective so far. Like all approaches it needs some concept of document similarity (or distance, as is the case with k-means and DBSCAN)

Document similarity

First, the pairwise similarity between all documents is computed.

Here I'm trying to take advantage of some features of the news - in particular, the assumption that articles that belong to the same event will typically talk about the same named entities.

The simplest similarity computation for a pair of documents is just the size of entity intersection (I'll go into some improvements on this further on).

Detecting outliers

Given these pairwise similarities, how do we know when two documents are significantly similar? This could be framed as a new parameter - i.e. some threshold entity intersection, above which we can say two documents are sufficiently alike.

However, this just introduces a new parameter selection problem, when the whole point of this new approach is to avoid doing exactly that!

Instead, statistical outliers seem promising. For a each document, we identify which of its similarities with other documents is an upper outlier, and consider those significant.

There are a few techniques for detecting outliers.

Perhaps the simplest involves using quartiles of the data: look at values above $Q_3 + 2(Q_3 - Q_1)$. The main problem with this technique is that it's sensitive to the sample size, e.g. if you are looking at a handful of documents, you're unlikely to get reliable results.

Another technique is similar to the $\epsilon$ selection for DBSCAN, where abnormally large jumps between values are identified, and outliers are considered values above that jump.

A third technique is based on median-absolute-deviation (MAD), outlined here. This technique has the benefit of not being sensitive to sample size.

In practice, the simple quartile approach has worked best. I haven't looked at document sets of less than 10 in size, so perhaps below that it falls apart.

Creating the graph

These documents and their outlier similarities were then used to construct a graph where these outlier similarities define edges and their weights between documents.

Identifying and disambiguating cliques

Once the graph is formed, it's just a matter of running a clique finding algorithm (the networkx library includes one) on it.

There is one last issue - an article may belong to multiple cliques, so we need to disambiguate them. The method is simple: stay in the clique for which membership is the strongest (as quantified by the sum of edge weights) and leave all the others.

Initial results

The initial results of this approach were promising:

Completeness Homogeneity
10E 0.9287775491 0.9527683893
20E 0.8347274166 0.9724715221
30E 0.8491559191 0.9624775352

And here is the graph for the 10E dataset (all graphs are shown before clique disambiguation):

10E hs_cluster graph
10E hs_cluster graph

(Node labels are in the format of {cluster id}_{article id}.)

It looks pretty good - some of the true clusters are isolated cliques. But there are still many connections between unrelated articles, so it seems that there's room for improvement.

Weighting intersections

"Global" IDF

Up until now we were only looking at intersection of entities, so all entities are treated equally. However, there are some fairly uninformative entities, like "US" or "Congress" which we expect to see naturally occur in many articles. We want to down-weight these entities since their appearance across two documents is unremarkable. Conversely, we want to up-weight entities which rarely occur, since their presence in an intersection is unlikely and therefore more meaningful.

Normally the approach here would be to incorporate IDF (inverse document frequency) weighting; typically this IDF is calculated from the input documents. However, the corpora I'm working with now (and expect to use in the future) are fairly small, perhaps 30-100 documents at a time, which limits the possibility of any meaningful IDF being generated. Instead I've opted for an IDF model built on an external source of knowledge - about 113,200 New York Times articles. That way we have some "global" understanding of a term's importance in the context of news articles.

So now instead of merely looking at the size of the intersection, it uses the sum of the IDF weights of the intersection's members.

Inverse length weighting

The possible intersection between two documents is affected by the length of the documents, so the IDF-weighted overlaps are then further weighted by the inverse of the average length of the two documents being compared.

Results of weighting

Incorporating these weights seem to help for the most part:

Completeness Homogeneity
10E 0.9557853457 1.0
20E 0.8375545102 0.9793536415
30E 0.8548535867 0.9689355683

Here is the resulting graph for 10E:

10E hs_cluster graph
10E hs_cluster graph

It looks a lot cleaner than before!

Incorporating keywords

A grim observation from the data is that entities may not be enough to fully discern different news events, such as these two events in the 10E dataset:

[2] US soldier arrested for rape and four murders in Iraq
[5] Two Iraqi MPs killed in bombing at parliament

And these two events in the 20E dataset:

[3] Al Jazeera airs new Osama Bin Laden tape
[5] Bin Laden's former driver convicted of supporting terrorism

(Event names are taken from their Wikinews page title)

It seemed that looking at keywords might provide the missing information to disambiguate these events. So for each article I applied a few keyword extraction algorithms followed by the same approach that was used for entities: use the length-weighted sum of the IDF-weighted keyword intersections.

The final similarity score is then just the sum of this keyword score and the entity score from before.

The results for 10E are actually hurt a bit by this, but improved slightly for 20E and 30E:

Completeness Homogeneity
10E 0.9051290086 0.9527683893
20E 0.8574072716 0.9888333740
30E 0.8414058524 0.9704378548

If we look at the graph for 10E, it's clear why the results were worse:

10E hs_cluster graph
10E hs_cluster graph

More noise (i.e. unwanted edges) have been introduced, as you might expect.

For now, I've left the keyword intersection in, but it may require some additional processing to eliminate noisy terms.

Next steps

hs_cluster can overshoot the true number of events by quite a bit, but this is because it is still very conservative in how it connects articles together for now. Finding so many extra clusters (i.e. lower completeness) indicates missing edges (this is clear in the graph for the 20E dataset, below), so edge formation needs more investigation: is it a problem of outlier detection? Is it a problem of computing document similarity?

20E hs_cluster graph
20E hs_cluster graph

It's also unclear how hs_cluster will perform on larger document sets. The 30E dataset includes 103 documents and takes 11-12 seconds. This grows exponentially with the graph size. I have not looked deeply into clique-finding algorithms, but a quick look shows that the Bron-Kerbosch algorithm is an efficient way of identifying cliques.

Overall, the results are encouraging. There's still room for improvement, but it seems like a promising direction.

Source

Source code for the project is available at https://github.com/frnsys/hscluster. The IDF models are not included there; if you are interested, please get in touch - I'm on Twitter as @frnsys.


Pasture

Image source, modified and licensed under CC BY-SA 3.0.
Image source, modified and licensed under CC BY-SA 3.0.

The class I've been teaching this fall at the New School, News Automata, didn't require any previous programming experience, so I wanted to find a way to teach my students some basics. I had a plan where I'd get them setup and rolling with the basics and then transition into a journalism-related program involving some text processing libraries.

But when it came time to teach the class, almost the entire two hours were spent setting up the students' development environments. We were working in a school computer lab so students didn't have access to the permissions they might need to install packages and what not. I figured out a decent workaround, but even then, if you aren't already familiar with this stuff, environment configuration is super tedious and frustrating. It's often that way even if you are already familiar with this stuff.

Later, we tried again: this time students brought in their own computers. But of course, everyone had different systems - OSX, Windows, ChromeOS...and I realized that all the tools which make env setup easier (package managers and so on) require their own setup which just complicated things even further. And the packages I wanted to get the students using involved some lower-level libraries, such as numpy and scipy, required compiling which could take ages depending on what hardware the student had. By the time most of the students had their environment setup, everyone was exhausted and dispirited. What a mess.

So to make this all a bit easier, I put together a system which allows students to share a remote development environment. All I had to do was setup the environment once on my own server, and then students could submit code to it through a web interface. As long as the student had a browser, they could write their scripts. It was a huge, huge time saver - everyone could dive right into playing with some code.

I cleaned up the project and turned it into a package to reuse: Pasture. Now you can install it like you would any pypi package and build something on top of it.

I've included an example from the class I was teaching. I had students try their hand at writing some simple newsfeed filtering algorithms. Building on top of Pasture, students could submit a script which included a filtering function and see a newsfeed filtered in real(-ish) time:


Check out the GitHub repo for usage instructions and the full example.


Nomadic

I'm a big supporter of decentralized internet services but I used to use Evernote because it was really convenient. That's always the problem isn't it? Some things are just so convenient ¯\_(? ? ?)_/¯. A few months ago I began using Linux a lot more and when I discovered that there was neither a Linux Evernote client, nor were there any plans to release one. it seemed like a good opportunity to make the transition to another service.

But I didn't really like any of the other available services, and since most of them were still centralized, so I ended up making my own: Nomadic.

It doesn't pack nearly as many features as Evernote does, but that's ok because I think Evernote has too many features anyways. Nomadic can still do quite a lot:

  • GitHub-flavored Markdown
  • Some extra markdown features: highlighting, PDF embedding
  • Syntax highlighting
  • MathJax support
  • Automatically updates image references if they are moved
  • Full-text search (html, txt, markdown, and even PDFs)
  • A rich-text editor - I use it for copying stuff from webpages, though it's not the best at that. It does, however, convert pasted web content into Markdown.
  • A browsable site for your notes (Markdown notes are presented as HTML)
  • A complete command line interface
  • Notes can be exported as web presentations

For someone familiar with setting up web services, Nomadic is pretty easy to install. I'll refer you to the readme for instructions. For anyone else, it's kind of tricky. It would be great to make it non-dev friendly, and I might work on this later if people start asking for it.

Basically, all you do is manage a directory of notes, in Markdown or plaintext. The Nomadic daemon, nomadic-d, indexes new and changed notes, updates resource references if notes move, and runs a lightweight web server for you to browse and search through your notes. That's it.

If you want remote access, you can host Nomadic on a server of your own and put some authentication on it. Then you can SSH in to edit notes and what not. Down the line this kind of support would ideally be built into Nomadic and simplified for less technical users.

Alternatively, you can do what I do - use BitTorrent Sync, which is an application built on top of the BitTorrent protocol for decentralized "cloud" storage. So I sync my Nomadic notes folder across all my devices and can edit them locally on each. On my Android phone I use JotterPad to view and edit notes.

There are a bunch of tweaks and improvements that can be made, but I've been using it for the past four months or so and have felt no need to return to Evernote :)