Nomadic

12.14.2014 21:45

I’m a big supporter of decentralized internet services but I used to use Evernote because it was really convenient. That’s always the problem isn’t it? Some things are just so convenient ¯\_(? ? ?)_/¯. A few months ago I began using Linux a lot more and when I discovered that there was neither a Linux Evernote client, nor were there any plans to release one. it seemed like a good opportunity to make the transition to another service.

But I didn’t really like any of the other available services, and since most of them were still centralized, so I ended up making my own: Nomadic.

It doesn’t pack nearly as many features as Evernote does, but that’s ok because I think Evernote has too many features anyways. Nomadic can still do quite a lot:

  • GitHub-flavored Markdown
  • Some extra markdown features: highlighting, PDF embedding
  • Syntax highlighting
  • MathJax support
  • Automatically updates image references if they are moved
  • Full-text search (html, txt, markdown, and even PDFs)
  • A rich-text editor - I use it for copying stuff from webpages, though it’s not the best at that. It does, however, convert pasted web content into Markdown.
  • A browsable site for your notes (Markdown notes are presented as HTML)
  • A complete command line interface
  • Notes can be exported as web presentations

For someone familiar with setting up web services, Nomadic is pretty easy to install. I’ll refer you to the readme for instructions. For anyone else, it’s kind of tricky. It would be great to make it non-dev friendly, and I might work on this later if people start asking for it.

Basically, all you do is manage a directory of notes, in Markdown or plaintext. The Nomadic daemon, nomadic-d, indexes new and changed notes, updates resource references if notes move, and runs a lightweight web server for you to browse and search through your notes. That’s it.

If you want remote access, you can host Nomadic on a server of your own and put some authentication on it. Then you can SSH in to edit notes and what not. Down the line this kind of support would ideally be built into Nomadic and simplified for less technical users.

Alternatively, you can do what I do - use BitTorrent Sync, which is an application built on top of the BitTorrent protocol for decentralized “cloud” storage. So I sync my Nomadic notes folder across all my devices and can edit them locally on each. On my Android phone I use JotterPad to view and edit notes.

There are a bunch of tweaks and improvements that can be made, but I’ve been using it for the past four months or so and have felt no need to return to Evernote :)


Argos: Clustering

12.11.2014 03:30

In it’s current state, Argos is an orchestra of many parts:

  • argos - the core project
  • argos.cloud - the infrastructure deployment/configuration/management system
  • argos.corpora - a corpus builder for training and testing
  • argos.cluster, now galaxy - the document clustering package
  • argos.ios and argos.android - the mobile apps

It’s an expansive project so there are a lot of random rabbit holes I could go down. But for now I’m just going to focus on the process of developing the clustering system. This is the system which groups articles into events and events into stories, allowing for automatically-generated story timelines. At this point it’s probably where most of the development time has been spent.

Getting it wrong: a lesson in trying a lot

When I first started working on Argos, I didn’t have much experience in natural language processing (NLP) - I still don’t! But I have gained enough to work through some of Argos’s main challenges. That has probably been one of the most rewarding parts of the process - at the start some of the NLP papers I read were incomprehensible; now I have a decent grasp on their concepts and how to implement them.

The initial clustering approach was hierarchical agglomerative clustering (HAC) - “agglomerative” because each item starts in its own cluster and are merged sequentially by similarity (the two most similar clusters are merged, then the next two similar clusters are merged, etc), and “hierarchical” because the end result is a hierarchy as opposed to explicit clusters.

Intuitively it seemed like a good approach - HAC is agnostic to how similarity is calculated, which left a lot of flexibility in deciding what metric to use (euclidean, cosine, etc) and what features to use (bag-of-words, extracted entities, a combination of the two, etc). The construction of a hierarchy meant that clustering articles into events and clustering events into stories could be accomplished simultaneously - all articles would just be clustered once, and the hierarchy would be snipped at two different levels: once to generate the event, and again at a higher level to generate the stories.

Except I completely botched the HAC implementation and didn’t realize it for waaay too long. The cluster results sucked and I just thought the approach was inappropriate for this domain. To top it off, I hadn’t realized that I could just cluster once, snip twice (as explained above), and I was separately clustering articles into events and events into stories. This slowed things down a ton, but it was already super slow and memory-intensive to begin with.

Meanwhile I focused on developing some of the other functionality, and there was plenty of that to do. I postponed working on the clustering algorithm and told myself I’d just hire an NLP expert to consult on a good approach (i.e. I may never get around to it).

A few months later I finally got around to revisiting the clustering module. I re-read the paper describing HAC and then it became stunningly obvious that my implementation was way off base. I had some time off with my brother and together we wrote a much faster and much simpler implementation in less than an hour.

But even with that small triumph, I realized that HAC had, in this form, a fatal flaw. It generates the hierarchy in one pass and has no way of updating that hierarchy. If a new article came along, I had no choice but to reconstruct the hierarchy from scratch. Imagine if you had a brick building and the only way you could add another brick was by blowing the whole thing up and relaying each brick again. Clustering would become intolerably slow.

I spent awhile researching incremental or online clustering approaches - those which were well-suited to incorporating new data as it became available. In retrospect I should have immediately begun researching this kind of algorithm, but 6 months prior I didn’t know enough to consider it.

After some time I had collected a few approaches which seemed promising - including one which is HAC adapted for an incremental setting (IHAC). I ended up hiring a contractor (I’ll call him Sol) who had been studying NLP algorithms to help with their implementation (I didn’t want to risk another botch-implementation). Sol was fantastic and together we were able to try out most of the approaches.

IHAC was the most promising and is the one I ended up going with. It’s basically HAC with a modifiable hierarchy. The hierarchy can take a new piece of data and minimally restructure itself to incorporate it.

I rewrote Sol’s implementation (mainly to familiarize myself with it) and started evaluating it on a test data, trying to narrow down a set of parameters well-suited for news articles. It was pretty slow so I tried to parallelize it, but just a second process was enough to run into memory issues. After some profiling and rewriting of key memory bottlenecks, memory usage was reduced 75-95%! So now certain parts could be parallelized. But it still was quite slow, mainly because it was built using higher-level Python objects and methods.

I ended up rewriting the implementation again, this time moving as much as I could to numpy and scipy, very fast scientific computing Python libraries where a lot of the heavy lifting is done in C. Again, I saw huge improvements - the clustering went something like 12 to 20 times faster!

Of course, there were still some speedbumps along the way - bugs here and there, which in the numpy implementation were a bit harder to fix. But now I have a solid implementation which is fast, memory-efficient, persistent (using pytables), and takes full advantage of the algorithm’s hierarchical properties (getting events and stories in just two snips).

For the past few days Argos has been in a trial period, humming on a server collecting and clustering articles, and so far it has been doing surprisingly well. The difference between the original implementation and this new one is night and day.

At first Argos was only running on world and politics news, but today I added in some science, tech, and business news sources to see how it will handle those.

It was a long and exhausting journey, but more than anything I’m happy to see that the clustering is working well and quickly!


Argos

12.09.2014 20:42

Some of Argos’ onboarding screens.

Almost four years ago I got it in my head that it would be fun to try to build a way to automatically detect breaking news. I thought of creating a predictive algorithm based on deviations of word usage across the internet - if usage of the word “pizza” suddenly deviated from its historical mean, something was going on with pizza! Looking back, it was a really silly idea, but it got me interested in working programmatically with natural language and eventually (somehow) morphed into Argos, the news automation service I started working on a little over a year ago. Since I recently got Argos to a fairly-well functioning tech demo stage, this seems like a good spot to reflect on what’s been done so far.

The fundamental technical goal for Argos is to automatically apply structure to a chaotic news environment, to make news easier for computers to process and for people to understand.

When news pundits talk about the changing nature of news in the digital age, they often try to pinpoint the “atomic unit” of news. Stretching this analogy a bit far, Argos tries to break news into its “subatomic particles” and let others assemble them into whatever kind of atom they want. Argos can function as a service providing smaller pieces of news to readers, but also as a platform that other developers can build on.

The long-term vision for Argos is to contribute to what I believe is journalism’s best function - to provide a simulacrum of the world beyond our individual experience. There are a lot of things standing in the way of that goal. This initial version of Argos focuses on the two biggest obstacles: information overload and complex stories that span long time periods.

At this point in development, Argos watches new sources for new articles and automatically groups them into events. It’s then able to take these events and build stories out of them, presented as timelines. As an example, the grand jury announcing their verdict for Darren Wilson would be one event. Another event would be Darren Wilson’s resignation, and another would be the protests which followed the grand jury verdict in Ferguson and across the country. Multiple publications reported on each of these events. A lot of that reporting might be redundant, so by collapsing these articles into one unity, Argos eliminates some noise and redundancy.

Argos picked these events out of a few weeks worth of news stories and automatically compiled an event summary. These screenshots are from the Argos Android test app.

These events would all be grouped into the same story. The ongoing protests around Eric Garner’s murder would also be an event but would not necessarily be part of the same story, even though the two are related thematically.

A five-point summary is generated for each event, cited from that event’s source articles. Thus the timeline for a story functions as an automatically generated brief on everything that’s happened up until the latest event. The main use case here is long-burning stories like Ferguson or the Ukraine conflict which often are difficult to follow if you haven’t been following the story from the start.

Argos can also see what people, places, organizations, and other key terms are being discussed in an event, and instantaneously provide information about these terms to quickly inform or remind readers.

Argos detects concepts discussed in an event and can supplement some information, sourced from Wikipedia.

Finally, Argos calculates a “social (media) importance” score for each event to try and estimate what topics are trending. This is mainly to support a “day-in-brief” function (or “week-in-brief”, etc), i.e. the top n most “important” (assuming talked about == important) events of today. Later it would be great to integrate discussions and other social signals happening around an event.

I’ve been testing Argos mainly with world and political news (under the assumption that those would be easier to work with for technical reasons). So far that has been working well, so I recently started trying some different news domains, though it’s too early to say how that’s working out.

The API is not yet public and I’m not sure when it will be officially released. At the moment can’t devote a whole lot of time to the project (if you’re interested in becoming involved, get in touch. Argos does have an unreleased Android app (and an older version for iOS) which at this point is mainly just a tech demo for small-scale testing. Frankly, I don’t know if Argos will work best as a consumer product or some intermediary technology powering some other consumer service.

(Later I’ll write a post detailing the development of Argos up until now.)


The Founder: Articles of Incorporation

12.07.2014 12:22

I’ve wanted to make a video game for a while now, and I’ve finally started working on one in the past few months. It’s called The Founder and it is a dystopian business simulator. In this game, you play an entrepreneur with your sights set on becoming the next disruptive innovator dreaming up a company that will bravely push the world into a new, brighter future.


Some early art for the game - you can play as a pharmatech giant

Dystopian fiction is great. But so many of these works plop the reader in some ambiguously distant future and start things from there. The progression to that point, say, from our own historical moment, is far more interesting to me. It’s more startling to see how our current systems, values, and ways of thinking which intuitively feel like they make sense gradually move us towards a world none of us would have ever conceded to.

Technosolutionism as a perspective has a lot of innate appeal to it - how nice would it be if our ills could be easily solved by a simple engineering feat or a new scientific breakthrough, neatly packaged into some product or service? Of course, such a perspective neglects the fact that, while technology is marvelous and certainly has the potential to solve many hard problems, not all problems can be solved by it, and even for those that it can solve, technology is only ever realized in a social context. Technology takes on the values of the environment in which it was nurtured: if developed under the logic of business interests, it comes with or flourishes as new modes of control. We’ve already seen in forms both technological - DRM and deep packet inspection, for instance - and legal - DMCA, SOPA/PIPA, net neutrality, for instance.

Those examples (I reckon) are universally acknowledged as bad ideas. Things get more interesting when you realize many lauded consumer products and services are just as nefarious, if not more so for their insidious nature.

As a few examples: Uber force workers to abide by unfair contracts under the guise of economic independence (among many of their other transgressions), AirBnB destabilize rent prices and undercut affordable housing while pumping an ad campaign to convince residents otherwise, and Google, Facebook, Twitter, Comcast, Time Warner, etc. are (of course) influencing government policy through lots and lots of lobbying, ostensibly for our benefit.

The Founder takes this world - our world - starting from a decade or two ago, and plots out a possible, if exaggerated, trajectory. How is the promise of technology captured and expressed when borne out of the businesses’ relentlessly growth-oriented, profit-seeking logic? What does progress look like in a world obsessed with growth as measured by sheer economic output rather than metrics of well-being?

Like any good power-fantasy game, you are at the center of it! The Founder puts players into the shoes of someone who has bought into techno-capitalist logic and lets them loose in a fictional universe. The player will be instrumental in the progression to an unsavory future; a progression which will (ideally) feel disturbingly rational under the logic of the game. Winning in The Founder means shaping a world in which you are successful - at the expense of it being a dystopia for almost everyone else. It’s not what you set out to do as the Founder, but that is what manifests from the definitions of short-term success within the game. Although The Founder brings it to levels of absurdity, I’m hoping the parallels between the preposterous world of The Founder and our own are clear.


News Automata

12.06.2014 00:46
etc

Since mid-September I’ve been teaching at The New School as part of their new Journalism+Design program. Heather Chaplin, who organizes the program, brought me in to teach a new class I put together called News Automata, which explores the technical underpinnings, ethical concerns, and potential benefits/pitfalls of technology in journalism.

The way I ended up teaching the class deviated a bit from my original plan. For reference, here’s the initial syllabus:


News Automata Fall 2014 syllabus

Week 1: Introduction to the Course

Introduction to journalism and how technology is affecting it both as an industry and as a practice. What is the point of journalism? We’ll talk about a few of the major definitions of journalism’s responsibility.

We’ll also talk about the format of the course and what’ll be expected.

Some things we’ll go over:

  • Bias in journalism? Is “fair and balanced” actually fair?
  • Does journalism work?
  • Varying definitions of journalism: the spotlight, the watchdog, the entertainer, the simulacrum, the influencer, the activator, the business

Week 2-3: Leveraging networks: from consumers => producers

How are individuals becoming empowered to participate in the production of news, whether through whistleblowing or “citizen journalism”? How does news “break” in the age of Twitter? How are journalists collaborating with on-the-ground sources to have direct access to those affected, in real time? How are online communities becoming involved in the investigative process? We’ll discuss some of the challenges in this area, such as verifying accuracy and security, and look at how things can go wrong.

Some things we’ll go over:

Week 4: Bots and drones: the automated assemblage

Automation is creeping into more and more parts of our lives. Jobs which existed a few decades ago are now obsolete. Jobs which exist now will likely be obsolete in an even shorter amount of time. Will journalists be one of them?

Some things we’ll go over:

Week 5: Information overload and context

The information age is a blessing and a curse. What good is all that information if you don’t have the time, energy, or attention to make use of it? What approaches are being used to make the news easier to digest and give us the fullest understanding of what’s happening in the world? Technology and design are both about getting more with less. How can we get the maximum impact from the smallest amount of information?

Some things we’ll go over:

Week 6: Engineering virality and control over networks

Facebook wants you to be happy, BuzzFeed knows you like lists and quizzes, and Upworthy understands how to tease you into clicking a link. They have all been immensely successful. Is that all these companies are about? Or is there something larger at play? What happens when all news is designed just to get us to click into it? Or has that already happened?

Some things we’ll go over:

Week 7: The Filter Bubble and the new gatekeepers

The most prominent approach to managing information overload is self, algorithmic, or thought-leader curation. But the nature of these three filtering mechanisms leads many to worry – are we just seeing more of the same? How does that affect how we think about the world? Is that fundamentally antithetical to journalism’s aspirations as a practice?

Some things we’ll go over:

Week 8: Taking action

Journalism is more than about knowing what’s happening. It’s also about acting on that knowledge. How can we design systems to tighten that action-intent gap? How can we make it easier for people to organize and act on the issues they feel strongly about?

Some things we’ll go over:


Here’s what the classes ended up being about:

  • What does journalism hope to accomplish?
  • Social networks and news production: crowdsourcing/citizen journalism, problems of verification, perspectives, popular curation vs the gatekeeper model
  • Automation, news bots, intro to machine learning, intro to natural language processing
  • Communities and discussions online, anonymity, and bits of behavioral economics, game theory, decision theory, group dynamics, sociological/psychological research for understanding online behavior
  • How elections are reported (the week after midterm elections), how algorithmic filters work, filter bubbles/“personalized propaganda”
  • Hands-on: building our own news feeds and algorithmic filters with Python
  • The maturing medium of video games (narrative, mechanics, aesthetic, and technical perspectives) and how it relates to journalism

There are still two more classes in which I plan on covering:

  • The physical basis of the internet (an overview on its infrastructure and the politics of that infrastructure)
  • Taking action and digital impact IRL: slacktivism, hacktivism (Anonymous, Wikileaks), doxxing, DDoS/LOIC, etc

This was my first time teaching a class so it’s been a great learning experience for me, and my students were great. I’m hoping to teach it again next fall. The class was public which ended up being a boon (I was worried about it at first) - lots of people from all sorts of backgrounds stopped in for a class or two and had interesting contributions to our discussions.

<< >>