Argos: Clustering

12.11.2014

projects

In it's current state, Argos is an orchestra of many parts:

argos - the core project
argos.cloud - the infrastructure deployment/configuration/management system
argos.corpora - a corpus builder for training and testing
argos.cluster, now galaxy - the document clustering package
argos.ios and argos.android - the mobile apps

It's an expansive project so there are a lot of random rabbit holes I could go down. But for now I'm just going to focus on the process of developing the clustering system. This is the system which groups articles into events and events into stories, allowing for automatically-generated story timelines. At this point it's probably where most of the development time has been spent.

Getting it wrong: a lesson in trying a lot

When I first started working on Argos, I didn't have much experience in natural language processing (NLP) - I still don't! But I have gained enough to work through some of Argos's main challenges. That has probably been one of the most rewarding parts of the process - at the start some of the NLP papers I read were incomprehensible; now I have a decent grasp on their concepts and how to implement them.

The initial clustering approach was hierarchical agglomerative clustering (HAC) - "agglomerative" because each item starts in its own cluster and are merged sequentially by similarity (the two most similar clusters are merged, then the next two similar clusters are merged, etc), and "hierarchical" because the end result is a hierarchy as opposed to explicit clusters.

Intuitively it seemed like a good approach - HAC is agnostic to how similarity is calculated, which left a lot of flexibility in deciding what metric to use (euclidean, cosine, etc) and what features to use (bag-of-words, extracted entities, a combination of the two, etc). The construction of a hierarchy meant that clustering articles into events and clustering events into stories could be accomplished simultaneously - all articles would just be clustered once, and the hierarchy would be snipped at two different levels: once to generate the event, and again at a higher level to generate the stories.

Except I completely botched the HAC implementation and didn't realize it for waaay too long. The cluster results sucked and I just thought the approach was inappropriate for this domain. To top it off, I hadn't realized that I could just cluster once, snip twice (as explained above), and I was separately clustering articles into events and events into stories. This slowed things down a ton, but it was already super slow and memory-intensive to begin with.

Meanwhile I focused on developing some of the other functionality, and there was plenty of that to do. I postponed working on the clustering algorithm and told myself I'd just hire an NLP expert to consult on a good approach (i.e. I may never get around to it).

A few months later I finally got around to revisiting the clustering module. I re-read the paper describing HAC and then it became stunningly obvious that my implementation was way off base. I had some time off with my brother and together we wrote a much faster and much simpler implementation in less than an hour.

But even with that small triumph, I realized that HAC had, in this form, a fatal flaw. It generates the hierarchy in one pass and has no way of updating that hierarchy. If a new article came along, I had no choice but to reconstruct the hierarchy from scratch. Imagine if you had a brick building and the only way you could add another brick was by blowing the whole thing up and relaying each brick again. Clustering would become intolerably slow.

I spent awhile researching incremental or online clustering approaches - those which were well-suited to incorporating new data as it became available. In retrospect I should have immediately begun researching this kind of algorithm, but 6 months prior I didn't know enough to consider it.

After some time I had collected a few approaches which seemed promising - including one which is HAC adapted for an incremental setting (IHAC). I ended up hiring a contractor (I'll call him Sol) who had been studying NLP algorithms to help with their implementation (I didn't want to risk another botch-implementation). Sol was fantastic and together we were able to try out most of the approaches.

IHAC was the most promising and is the one I ended up going with. It's basically HAC with a modifiable hierarchy. The hierarchy can take a new piece of data and minimally restructure itself to incorporate it.

I rewrote Sol's implementation (mainly to familiarize myself with it) and started evaluating it on a test data, trying to narrow down a set of parameters well-suited for news articles. It was pretty slow so I tried to parallelize it, but just a second process was enough to run into memory issues. After some profiling and rewriting of key memory bottlenecks, memory usage was reduced 75-95%! So now certain parts could be parallelized. But it still was quite slow, mainly because it was built using higher-level Python objects and methods.

I ended up rewriting the implementation again, this time moving as much as I could to numpy and scipy, very fast scientific computing Python libraries where a lot of the heavy lifting is done in C. Again, I saw huge improvements - the clustering went something like 12 to 20 times faster!

Of course, there were still some speedbumps along the way - bugs here and there, which in the numpy implementation were a bit harder to fix. But now I have a solid implementation which is fast, memory-efficient, persistent (using pytables), and takes full advantage of the algorithm's hierarchical properties (getting events and stories in just two snips).

For the past few days Argos has been in a trial period, humming on a server collecting and clustering articles, and so far it has been doing surprisingly well. The difference between the original implementation and this new one is night and day.

At first Argos was only running on world and politics news, but today I added in some science, tech, and business news sources to see how it will handle those.

It was a long and exhausting journey, but more than anything I'm happy to see that the clustering is working well and quickly!

Argos

12.09.2014

projects

Almost four years ago I got it in my head that it would be fun to try to build a way to automatically detect breaking news. I thought of creating a predictive algorithm based on deviations of word usage across the internet - if usage of the word "pizza" suddenly deviated from its historical mean, something was going on with pizza! Looking back, it was a really silly idea, but it got me interested in working programmatically with natural language and eventually (somehow) morphed into Argos, the news automation service I started working on a little over a year ago. Since I recently got Argos to a fairly-well functioning tech demo stage, this seems like a good spot to reflect on what's been done so far.

The fundamental technical goal for Argos is to automatically apply structure to a chaotic news environment, to make news easier for computers to process and for people to understand.

When news pundits talk about the changing nature of news in the digital age, they often try to pinpoint the "atomic unit" of news. Stretching this analogy a bit far, Argos tries to break news into its "subatomic particles" and let others assemble them into whatever kind of atom they want. Argos can function as a service providing smaller pieces of news to readers, but also as a platform that other developers can build on.

The long-term vision for Argos is to contribute to what I believe is journalism's best function - to provide a simulacrum of the world beyond our individual experience. There are a lot of things standing in the way of that goal. This initial version of Argos focuses on the two biggest obstacles: information overload and complex stories that span long time periods.

At this point in development, Argos watches new sources for new articles and automatically groups them into events. It's then able to take these events and build stories out of them, presented as timelines. As an example, the grand jury announcing their verdict for Darren Wilson would be one event. Another event would be Darren Wilson's resignation, and another would be the protests which followed the grand jury verdict in Ferguson and across the country. Multiple publications reported on each of these events. A lot of that reporting might be redundant, so by collapsing these articles into one unity, Argos eliminates some noise and redundancy.

Argos picked these events out of a few weeks worth of news stories and automatically compiled an event summary. These screenshots are from the Argos Android test app.

These events would all be grouped into the same story. The ongoing protests around Eric Garner's murder would also be an event but would not necessarily be part of the same story, even though the two are related thematically.

A five-point summary is generated for each event, cited from that event's source articles. Thus the timeline for a story functions as an automatically generated brief on everything that's happened up until the latest event. The main use case here is long-burning stories like Ferguson or the Ukraine conflict which often are difficult to follow if you haven't been following the story from the start.

Argos can also see what people, places, organizations, and other key terms are being discussed in an event, and instantaneously provide information about these terms to quickly inform or remind readers.

Argos detects concepts discussed in an event and can supplement some information, sourced from Wikipedia.

Finally, Argos calculates a "social (media) importance" score for each event to try and estimate what topics are trending. This is mainly to support a "day-in-brief" function (or "week-in-brief", etc), i.e. the top n most "important" (assuming talked about == important) events of today. Later it would be great to integrate discussions and other social signals happening around an event.

I've been testing Argos mainly with world and political news (under the assumption that those would be easier to work with for technical reasons). So far that has been working well, so I recently started trying some different news domains, though it's too early to say how that's working out.

The API is not yet public and I'm not sure when it will be officially released. At the moment can't devote a whole lot of time to the project (if you're interested in becoming involved, get in touch.
Argos does have an unreleased Android app (and an older version for iOS) which at this point is mainly just a tech demo for small-scale testing. Frankly, I don't know if Argos will work best as a consumer product or some intermediary technology powering some other consumer service.

(Later I'll write a post detailing the development of Argos up until now.)

The Founder: Articles of Incorporation

12.07.2014

projects

I've wanted to make a video game for a while now, and I've finally started working on one in the past few months. It's called The Founder and it is a dystopian business simulator. In this game, you play an entrepreneur with your sights set on becoming the next disruptive innovator dreaming up a company that will bravely push the world into a new, brighter future.

Some early art for the game - you can play as a pharmatech giant

Dystopian fiction is great. But so many of these works plop the reader in some ambiguously distant future and start things from there. The progression to that point, say, from our own historical moment, is far more interesting to me. It's more startling to see how our current systems, values, and ways of thinking which intuitively feel like they make sense gradually move us towards a world none of us would have ever conceded to.

Technosolutionism as a perspective has a lot of innate appeal to it - how nice would it be if our ills could be easily solved by a simple engineering feat or a new scientific breakthrough, neatly packaged into some product or service? Of course, such a perspective neglects the fact that, while technology is marvelous and certainly has the potential to solve many hard problems, not all problems can be solved by it, and even for those that it can solve, technology is only ever realized in a social context. Technology takes on the values of the environment in which it was nurtured: if developed under the logic of business interests, it comes with or flourishes as new modes of control. We've already seen in forms both technological - DRM and deep packet inspection, for instance - and legal - DMCA, SOPA/PIPA, net neutrality, for instance.

Those examples (I reckon) are universally acknowledged as bad ideas. Things get more interesting when you realize many lauded consumer products and services are just as nefarious, if not more so for their insidious nature.

As a few examples: Uber force workers to abide by unfair contracts under the guise of economic independence (among many of their other transgressions), AirBnB destabilize rent prices and undercut affordable housing while pumping an ad campaign to convince residents otherwise, and Google, Facebook, Twitter, Comcast, Time Warner, etc. are (of course) influencing government policy through lots and lots of lobbying, ostensibly for our benefit.

The Founder takes this world - our world - starting from a decade or two ago, and plots out a possible, if exaggerated, trajectory. How is the promise of technology captured and expressed when borne out of the businesses' relentlessly growth-oriented, profit-seeking logic? What does progress look like in a world obsessed with growth as measured by sheer economic output rather than metrics of well-being?

Like any good power-fantasy game, you are at the center of it! The Founder puts players into the shoes of someone who has bought into techno-capitalist logic and lets them loose in a fictional universe. The player will be instrumental in the progression to an unsavory future; a progression which will (ideally) feel disturbingly rational under the logic of the game. Winning in The Founder means shaping a world in which you are successful - at the expense of it being a dystopia for almost everyone else. It's not what you set out to do as the Founder, but that is what manifests from the definitions of short-term success within the game. Although The Founder brings it to levels of absurdity, I'm hoping the parallels between the preposterous world of The Founder and our own are clear.

News Automata

12.06.2014

etc

Since mid-September I've been teaching at The New School as part of their new Journalism+Design program. Heather Chaplin, who organizes the program, brought me in to teach a new class I put together called News Automata, which explores the technical underpinnings, ethical concerns, and potential benefits/pitfalls of technology in journalism.

The way I ended up teaching the class deviated a bit from my original plan. For reference, here's the initial syllabus:

News Automata Fall 2014 syllabus

Week 1: Introduction to the Course

Introduction to journalism and how technology is affecting it both as an industry and as a practice. What is the point of journalism? We'll talk about a few of the major definitions of journalism's responsibility.

We'll also talk about the format of the course and what'll be expected.

Some things we'll go over:

Bias in journalism? Is "fair and balanced" actually fair?
Does journalism work?
Varying definitions of journalism: the spotlight, the watchdog, the entertainer, the simulacrum, the influencer, the activator, the business

Week 2-3: Leveraging networks: from consumers => producers

How are individuals becoming empowered to participate in the production of news, whether through whistleblowing or "citizen journalism"? How does news "break" in the age of Twitter? How are journalists collaborating with on-the-ground sources to have direct access to those affected, in real time? How are online communities becoming involved in the investigative process? We'll discuss some of the challenges in this area, such as verifying accuracy and security, and look at how things can go wrong.

Some things we'll go over:

The Reddit Boston bomber scandal
Theory of the interlocking public
Credibility on Twitter
Ethan Zuckerman's "Cute Cat Theory of Digital Activism
The Fluff Principle
Whistleblowing and security
Internet activism/vigilantism: Wikileaks, Anonymous

Week 4: Bots and drones: the automated assemblage

Automation is creeping into more and more parts of our lives. Jobs which existed a few decades ago are now obsolete. Jobs which exist now will likely be obsolete in an even shorter amount of time. Will journalists be one of them?

Some things we'll go over:

Automated reporting (earthquake bots, environmental drones, corporate earnings bots, crime reporting bots)
Detecting breaking news automatically
Can an algorithm be biased?
Twitter and other bots (NewsDiff, congressedits)
How bots are being used in newsrooms
IFTTT legislation text alerts
Automatically mining documents and reports with the Overview Project and Document Cloud

Week 5: Information overload and context

The information age is a blessing and a curse. What good is all that information if you don't have the time, energy, or attention to make use of it? What approaches are being used to make the news easier to digest and give us the fullest understanding of what's happening in the world? Technology and design are both about getting more with less. How can we get the maximum impact from the smallest amount of information?

Some things we'll go over:

The importance of context in news
Information overload and "news fatigue"
Automating context: Fold, Argos, Dictionary of Numbers, All Are Green
Explaining the news: Vox
Summarizing the news: Circa, Inside

Week 6: Engineering virality and control over networks

Facebook wants you to be happy, BuzzFeed knows you like lists and quizzes, and Upworthy understands how to tease you into clicking a link. They have all been immensely successful. Is that all these companies are about? Or is there something larger at play? What happens when all news is designed just to get us to click into it? Or has that already happened?

Some things we'll go over:

The psychology of sharing on social media
Astroturfing (corporate-sponsored "grassroots")
Facebook emotional manipulation
Can an algorithm be biased?

Week 7: The Filter Bubble and the new gatekeepers

The most prominent approach to managing information overload is self, algorithmic, or thought-leader curation. But the nature of these three filtering mechanisms leads many to worry are we just seeing more of the same? How does that affect how we think about the world? Is that fundamentally antithetical to journalism's aspirations as a practice?

Some things we'll go over:

Week 8: Taking action

Journalism is more than about knowing what's happening. It's also about acting on that knowledge. How can we design systems to tighten that action-intent gap? How can we make it easier for people to organize and act on the issues they feel strongly about?

Some things we'll go over:

Slacktivism/hashtag activism
What happens to #Ferguson affects Ferguson

Here's what the classes ended up being about:

What does journalism hope to accomplish?
Social networks and news production: crowdsourcing/citizen journalism, problems of verification, perspectives, popular curation vs the gatekeeper model
Automation, news bots, intro to machine learning, intro to natural language processing
Communities and discussions online, anonymity, and bits of behavioral economics, game theory, decision theory, group dynamics, sociological/psychological research for understanding online behavior
How elections are reported (the week after midterm elections), how algorithmic filters work, filter bubbles/"personalized propaganda"
Hands-on: building our own news feeds and algorithmic filters with Python
The maturing medium of video games (narrative, mechanics, aesthetic, and technical perspectives) and how it relates to journalism

There are still two more classes in which I plan on covering:

The physical basis of the internet (an overview on its infrastructure and the politics of that infrastructure)
Taking action and digital impact IRL: slacktivism, hacktivism (Anonymous, Wikileaks), doxxing, DDoS/LOIC, etc

This was my first time teaching a class so it's been a great learning experience for me, and my students were great. I'm hoping to teach it again next fall. The class was public which ended up being a boon (I was worried about it at first) - lots of people from all sorts of backgrounds stopped in for a class or two and had interesting contributions to our discussions.

Discovering high-quality comments and discussions

12.05.2014

etc

Say you have a lot of users commenting on your service, and inevitably all the pains of unfettered (pseudo-)anonymous chatter emerge - spam, abusive behavior, trolling, meme circlejerking, etc. How do you sort through all this cruft? You can get community feedback - up and down votes, for example - but then you have to make sense of all of that too.

There are many different approaches for doing so. Is any one the best? I don't think so - it really depends on the particular context and the nuances of what you hope to achieve. There is probably not a single approach generalizable to all social contexts. We won't unearth some grand theory of social physics which could support such a solution; things don't work that way.

Nevertheless! I want to explore a few of those approaches and see if any come out more promising than the others. This post isn't going to be exhaustive but is more of a starting point for ongoing research.

Goals

There are a lot of different goals a comment ranking system can be used for, such as:

detection of spam
detection of abusive behavior
detection of high-quality contributions

Here, we want to detect high-quality contributions and minimize spam and abusive behavior. The three are interrelated, but the focus is detecting high-quality contributions, since we can assume that it encapsulates the former two.

Judge

First of all, what is a "good" comment? Who decides?

Naturally, it really depends on the site. If it's some publication, such as the New York Times, which has in mind a particular kind of atmosphere they want to cultivate (top-down), that decision is theirs.

If, however, we have a more community-driven (bottom-up) site built on some agnostic framework (such as Reddit), then the criteria for "good" is set by the members of that community (ideally).

Size and demographics of the community also play a large part in determining this criteria. As one might expect, a massive site with millions of users has very different needs than a small forum with a handful of tight-knit members.

So what qualifies as good will vary from place to place. Imgur will value different kinds of contributions than the Washington Post. As John Suler, who originally wrote about the Online Disinhibition Effect, puts it:

According to the theory of "cultural relativity," what is considered normal behavior in one culture may not be considered normal in another, and vice versa. A particular type of "deviance" that is despised in one chat community may be a central organizing theme in another.

To complicate matters further, one place may value many different kinds of content. What people are looking for varies day to day, so sometimes something silly is wanted, and sometimes something more cerebral is (BuzzFeed is good at appealing to both these impulses). But very broadly there are two categories of user contributions which may be favored by a particular site:

low-investment: quicker to digest, shorter, easier to produce. Memes, jokes, puns, and their ilk usually fall under this category. Comments are are more likely made based on the headline rather than the content of the article itself.
high-investment: lengthier, requires more thought to understand, and more work to produce.

For our purposes here, we want to rank comments according to the latter criteria - in general, comments tend towards the former, so creating a culture of that type of user contribution isn't really difficult. It often happens on its own (the "Fluff Principle"[1]) and that's what many services want to avoid! Consistently high-quality, high-investment user contribution is the holy grail.

I should also note that low-investment and high-investment user contributions are both equally capable of being abusive and offensive (e.g. a convoluted racist rant).

Comments vs discussions

I've been talking about high-quality comments but what we really want are high-quality discussions (threads). That's the whole point of a commenting system, isn't it? To get users engaged and discussing whatever the focus of the commenting is (an article, etc). So the relation of a comment to its subsequent or surrounding discussion will be an important criteria.

Our definition of high-quality

With all of this in mind, our definition of high-quality will be:

A user contribution is considered high quality if it is not abusive, is on-topic, and incites or contributes to civil, constructive discussion which integrates multiple perspectives.

Other considerations

Herd mentality & the snowball effect

We want to ensure that any comments which meet our criteria has equal opportunity of being seen. We don't want a snowball effect[2] where the highest-ranked comment always has the highest visibility and thus continues to attract the most votes. Some kind of churn is necessary for keeping things fresh but also so that no single perspective dominates the discussion. We want to discourage herd mentality or other polarizing triggers [foo] [3].

Gaming the system

A major challenge in designing such a system is the preventing anyone from gaming the system. That is, we want to minimize the effects of Goodhart's Law:

When a measure becomes a target, it ceases to be a good measure.

This includes automated attacks (e.g. bots promoting certain content) or corporate-orchestrated manipulation such as astroturfing (where a company tries to artificially generate a "grassroots" movement around a product or policy).

Minimal complexity

On the user end of things, we want to keep things relatively simple. We don't want to have to implement complicated user feedback mechanisms for the sake of gathering more data solely for better ranking comments. We should leverage existing features of the comment (detailed below) where possible.

Distinguish comment ranking from ranking the user

Below I've included "user features" but we should take care that we judge the comment, not the person. Any evaluation system needs to recognize that some people have bad days, which doesn't necessarily reflect on their general behavior[4]. And given enough signals that a particular behavior is not valued by the community, users can adjust their behavior accordingly - we want to keep that option open.

No concentration of influence

Expanding on this last point, we don't want our ranking system to enable an oligarchy of powerusers, which has a tendency of happening in some online communities. This can draw arbitrary lines between users; we want to minimize social hierarchies which may work to inhibit discussion.

Transparency

Ideally whatever approach is adopted is intuitive enough that any user of the platform can easily understand how their submissions are evaluated. Obscurity is too often used as a tool of control.

Adaptable and flexible

Finally, we should be conscious of historical hubris and recognize that it's unlikely we will "get it right" because getting it right is dependent on a particular cultural and historical context which will shift and evolve as the community itself changes. So whatever solution we implement, it should be flexible.

Our goals

In summary, what we hope to accomplish here is the development of some computational approach which:

requires minimal user input
requires minimal user input
minimizes snowballing/herd mentality effects
is difficult to game by bots, astroturfers, or other malicious users
promotes high-quality comments and discussions (per our definition)
penalizes spam
penalizes abusive and toxic behavior
is easy to explain and intuitive
maintains equal opportunity amongst all commenters (minimizes poweruser bias)
adaptable to change in community values

What we're working with

Depending on how your commenting system is structured, you may have the following data to work with:

Comment features:
- noisy user sentiment (simple up/down voting)
- clear user sentiment (e.g. Slashdot's more explicit voting: "insightful", for instance.)
- comment length
- comment content (the actual words that make up the comment)
- posting time
- the number of edits
- time interval between votes
Thread features:
- length of the thread's branches (and also the longest, min, avg, etc branch length)
- number of branches in the thread
- Moderation activity (previous bans, deletions, etc)
- Total number of replies
- Length of time between posts
- Average length of comments in the thread
User features:
- aggregate historical data of this user's past activity
- some value of reputation

Pretty much any commenting system includes a feature where users can express their sentiment about a particular comment. Some are really simple (and noisy) - basic up/down voting, for instance. It's really hard to tell what exactly a user means by an up or a downvote - myriad possible reactions as reduced to a single surface form. Furthermore, the low cost of submitting a vote means they will be used more liberally, which is perhaps the intended effect. But introducing high-cost unary or binary voting systems, such as Reddit Gold (which requires purchase before usage), have a scarcity which is a bit clearer in communicating the severity of the feedback. The advantage of low-cost voting is that its more democratic - anyone can do it. High-cost voting introduces barriers which may bar certain users from participating.

Other feedback mechanisms are more sophisticated and more explicit about the particular sentiment the user means to communicate. Slashdot allows randomly-selected moderators to specify whether a comment was insightful, funny, flamebait, etc, where each option has a different voting value.

I've included user features here, but as mentioned before, we want to rate the comment and discussion while trying to be agnostic about the user. So while you could have a reputation system of some kind, it's better to see how far you can go without one. Any user is wholly capable of producing good and bad content, and we care only about judging the content.

Approaches

Caveat: the algorithms presented here are just sketches meant to convey the general idea behind an approach.

Basic vote interpretation

The simplest approach is to just manipulate the explicit user feedback for a comment:

score = upvotes - downvotes

But this approach has a few issues:

scores are biased towards post time. The earlier someone posts, the greater visibility they have, therefore the greater potential they have for attracting votes. These posts stay at the top, thus maintaining their greater visibility and securing that position (snowballing effect).
it may lead to domination and reinforcement of the most popular opinion, drowning out any alternative perspectives.

In general, simple interpretations of votes don't really provide an accurate picture.

A simple improvement here would be to take into account post time. We can penalize older posts a bit to try and compensate for this effect.

score = (upvotes - downvotes) - age_of_post

Reddit used this kind of approach but ended up biasing post time too much, such those who commented earlier typically dominated the comments section. Or you could imagine someone who posts at odd hours - by the time others are awake to vote on it, the comment is already buried because of the time penalty.

This approach was later replaced by a more sophisticated interpretation of votes, taking them as a statistical sample rather than a literal value and using that to estimate a more accurate ranking of the comment (this is Reddit's "best" sorting):

If everyone got a chance to see a comment and vote on [a comment], it would get some proportion of upvotes to downvotes. This algorithm treats the vote count as a statistical sampling of a hypothetical full vote by everyone, much as in an opinion poll. It uses this to calculate the 95% confidence score for the comment. That is, it gives the comment a provisional ranking that it is 95% sure it will get to. The more votes, the closer the 95% confidence score gets to the actual score.

If a comment has one upvote and zero downvotes, it has a 100% upvote rate, but since there's not very much data, the system will keep it near the bottom. But if it has 10 upvotes and only 1 downvote, the system might have enough confidence to place it above something with 40 upvotes and 20 downvotes — figuring that by the time it's also gotten 40 upvotes, it's almost certain it will have fewer than 20 downvotes. And the best part is that if it's wrong (which it is 5% of the time), it will quickly get more data, since the comment with less data is near the top — and when it gets that data, it will quickly correct the comment's position. The bottom line is that this system means good comments will jump quickly to the top and stay there, and bad comments will hover near the bottom.

Here is a pure Python implementation, courtesy of Amir Salihefendic:

from math import sqrt

def _confidence(ups, downs):
    n = ups + downs

    if n == 0:
        return 0

    z = 1.0 #1.0 = 85%, 1.6 = 95%
    phat = float(ups) / n
    return sqrt(phat+z*z/(2*n)-z*((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

def confidence(ups, downs):
    if ups + downs == 0:
        return 0
    else:
        return _confidence(ups, downs)

Reddit's Pyrex implementation is available here.

The benefit here is you no longer need to take post time into account. When you are getting the vote count doesn't matter, all that matters is the sample size!

Manipulating the value of votes

How you interpret votes depends on how you collect and value you them. Different interaction design of voting systems, even if the end result is just up or downvote, can influence the value users place on a vote and under what conditions they submit one.

For instance, Quora makes only upvoters visible and hides downvoters. This is a big tangent so I will just leave it at that for now.

In terms of valuing votes, it's been suggested that a votes should not be equal - that they should be weighted according to the reputation of the voter:

Voting influence is not the same for all users: its not 1 (+1 or -1) for everyone but in the range 0-1.

When a user votes for a content item, they also vote for the creator (or submitter) of the content.

The voting influence of a user is calculated using the positive and negative votes that he has received for his submissions.

Exemplary users always have a static maximum influence.

Although like any system where power begets power, there's potential for a positive feedback loop where highly-voted people amass all the voting power and then you have an poweruser oligarchy.

Using features of the comment

For instance, we could use the simple heuristic: longer comments are better.

Depending on the site, there is some correlation between comment length and comment quality. This appears to be the case with Metafilter and Reddit:

(or at least we can say more discussion-driven subreddits have longer comments).

Using the structure of the thread.

Features of the thread itself can provide a lot of useful information.

Srikanth Narayan's tldr project visualizes some heuristics we can use for identifying certain types of threads.

A few examples from that project:

](/assets/uploads/a3895128ba1a9032b9ee9167cd63c09d.png)

![

Conclusion

Here I've only discussed a few approaches for evaluating the quality of comments and using that information to rank and sort them. But there are other techniques which can be used in support of the goals set out at the start of this post. For example, posting frequency or time interval limits could be set to discourage rapid (knee-jerk?) commenting, or the usage of real identities could be made mandatory (ala Facebook, not a very palpable solution), or scale posting privileges (ala Stack Overflow) or limit them to only certain users (ala Metafilter). Gawker's Kinja allows the thread parent to "dismiss" any child comments. You could provide explicit structure for responses, as the New York Times has experimented with. Disqus creates a global reputation across all its sites, but preserves some frontend pseudonymity (you can appear with different identities, but the backend ties them all together).

Spamming is best handled by shadowbanning, where users don't know that they're banned and have the impression that they are interacting with the site normally. Vote fuzzing is a technique used by Reddit where actual vote counts are obscured so that voting bots have difficulty verifying their votes or whether or not they have been shadowbanned.

What I've discussed here are technical approaches, which alone cannot solve many of the issues which plague online communities. The hard task of changing peoples' attitudes is pretty crucial too.

As Joshua Topolsky of the Verge ponders:

Maybe the way to encourage intelligent, engaging and important conversation is as simple as creating a world where we actually value the things that make intelligent, engaging and important conversation. You know, such as education, manners and an appreciation for empathy. Things we used to value that seem to be in increasingly short supply.

[1]: The Fluff Principle, as described by Paul Graham:

on a user-voted news site, the links that are easiest to judge will take over unless you take specific measures to prevent it. (source)

User LinuxFreeOrDie also runs through a theory on how user content sites tend towards low-investment content. The gist is that low-investment material is quicker to digest and more accessible, thus more people will vote on it more quickly, so in terms of sheer volume, that content is most likely to get the highest votes. A positive feedback effect happens where the low-investment material subsequently becomes the most visible, therefore attracting an even greater share of votes.

[2]: Users are pretty strongly influenced by the existing judgement on a comment, which can lead to a snowballing effect (at least in the positive direction.):

At least when it comes to comments on news sites, the crowd is more herdlike than wise. Comments that received fake positive votes from the researchers were 32% more likely to receive more positive votes compared with a control...And those comments were no more likely than the control to be down-voted by the next viewer to see them. By the end of the study, positively manipulated comments got an overall boost of about 25%. However, the same did not hold true for negative manipulation. The ratings of comments that got a fake down vote were usually negated by an up vote by the next user to see them. (source)

[3]: Researchers at George Mason University Center for Climate Change Communication found that negative comments set the tone for a discussion:

The researchers were trying to find out what effect exposure to such rudeness had on public perceptions of nanotech risks. They found that it wasn't a good one. Rather, it polarized the audience: Those who already thought nanorisks were low tended to become more sure of themselves when exposed to name-calling, while those who thought nanorisks are high were more likely to move in their own favored direction. In other words, it appeared that pushing people's emotional buttons, through derogatory comments, made them double down on their preexisting beliefs. (source)

[4]: Riot Games's player behavior team found that toxic behavior is typically sporadically distributed amongst normally well-behaving users:

if you think most online abuse is hurled by a small group of maladapted trolls, you're wrong. Riot found that persistently negative players were only responsible for roughly 13 percent of the game's bad behavior. The other 87 percent was coming from players whose presence, most of the time, seemed to be generally inoffensive or even positive. These gamers were lashing out only occasionally, in isolated incidents—but their outbursts often snowballed through the community. (source)

<< >>