House of Cards

07.17.2015

etc

Bots coast on the credentials of the real users of the computers they hijack. Bots were observed to click more often (but not improbably more often) than real people. Sophisticated bots moved the mouse, making sure to move the cursor over ads. Bots put items in shopping carts and visited many sites to generate histories and cookies to appear more demographically appealing to advertisers and publishers. - The Bot Baseline: Fraud in Digital Advertising. December 2014. White Ops, Inc. and the Association of National Advertisers.

The study cited above looked at 5.5 billion ad impressions over 60 days across 36 organizations. 23 percent of all observed video impressions were bots (p16).

Like too many things in our world, most useful internet services must justify themselves financially. The business model of much of the internet is based on advertising, and that whole schema is based on the assumption that advertising increases consumption. The further assumption there is that the subjects viewing the advertisements both can be influenced by ads and have the capacity to make purchases - that is, that they're human.

These behaviors are easily simulated, up to the point of purchase - although perhaps it makes economic sense for a bot to occasionally make a purchase in order to maintain the broader illusion of human engagement. There's something here about the dissolving of the internet human subject but I can't quite put my finger on it.

These bots are deployed by fraudsters to increase revenue, but since the purpose of automation is for outsourcing things we'd rather not do ourselves, maybe we regular folks can have bots simulating our own advertising engagement.

It's worth flipping through the report, if only for this title:

Similarity Metrics for Short Texts

07.17.2015

etc

Computing similarity metrics for short texts can be very difficult. It's the main challenge in developing Geiger. The problem is that text similarity metrics typically rely on exact overlap of terms (called "surface matching" because you match surface forms¹ of words), and short texts are sparse in their terms. There is more opportunity for overlap in longer documents by virtue of the fact that there are simply more words.

Say you have two news articles about employment. If these are longer documents, they may mention words like "work" or "jobs" or "employment". Similarity metrics reliant on common terms will work fine here.

Now say you have two comments about employment, both of which are fairly short, say ~400 characters each.

Here are two examples:

C2: This attack on another base of employment by democrats is why what began in the last elections will be finished by 2016. I can't wait to participate.

C1: Nobody thinks the fossil fuel industry opposition to clean air has anything to do with jobs. Establish new companies in these areas, and retrain miners to do those jobs. Or put them to work fixing our roads and bridges, as in FDR's day. Americans should be outraged that this is even an issue for the courts. What a waste! It's the air, stupid.

One comment mentions "employment" explicitly; the other only talks about "jobs" and "work". They are both talking about the same thing, but there is no exact overlap of these key terms, so common-term similarity metrics will fail to recognize that. This is referred to as a problem of synonyms.

The converse of the synonym problem is that of polysemy - where one term can mean different things, in different contexts. The typical example is "bank", which an mean a financial institution, or the side of a river, or as a verb. Maybe a comment says "good job" and really isn't saying thing at all about employment, but common-term similarity metrics won't recognize that.

There's quite a bit of literature on this problem (see below for a short list) - the popularity of Twitter as a dataset has spurned a lot of interest here. Most of the approaches turn to some external source of knowledge - variously referred to as "world knowledge", "background knowledge", "auxiliary data", "additional semantics", "external semantics", and perhaps by other names as well.

This external knowledge is usually another corpus of longer texts related to the short texts. Often this is Wikipedia or some subset of Wikipedia pages, but it could be something more domain-specific as well. You can use this knowledge to relate terms by co-occurrence, e.g. maybe you see that "job", "work", and "employment" occur together often in the Wikipedia page for "Employment", so you know that the terms have some relation.

Alternatively, with Wikipedia (or other corpora with some explicit structure) you could look at the pagelink or redirect graph for this information. Terms that redirect to one another could be considered synonymous, path length on the pagelink graph could be interpreted as a similarity degree between two terms. Wikipedia's disambiguation pages can help with the polysemy problem, leaning on term co-occurrence in the disambiguated pages (e.g. this comment contains "bank" and "finance"; only one of the Wikipedia pages for "Bank" also has both those terms).

I am still in the process of trying these methods, but they have some intuitive appeal. When we make sense of short documents – or any text for that matter – we always rely on background knowledge orders of magnitude greater than the current text we are looking at it. It's sensible to try and emulate that when working with machine text processing as well.

Side note: You can also approximately resolve synonyms using word embeddings (vector representations of single terms), such as those derived from a Word2Vec model. I say "approximately" because what word embeddings represent is how similar terms are not in terms of semantics, but by how "swappable" they are. That is, two word embeddings are similar if one can take the place of the other.

For example, consider the sentences "Climate change will be devastating" and "Global warming will be devastating". The terms aren't technically synonymous, but are often used as such, so we'll say that they practically are. A well-trained Word2Vec model will pick up that they often appear in similar contexts, so they will be considered similar.

But also consider "You did a good job" and "You did a bad job". Here the term "good" and "bad" appear in similar contexts, and so the Word2Vec model would call them similar. But we would not call them synonymous.

Referenced papers:

Yih, W. and Meek, C. Improving Similarity Measures for Short Segments of Text. 2007.
Hu, X., Sun N., Zhang C., Chua, T. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge. 2009.
Petersen, H., Poon, J. Enhancing Short Text Clustering with Small External Repositories. 2011.
Hu, X., Zhang X., Lu C., Park, E. Zhou X. Exploiting Wikipedia as External Knowledge for Document Clustering. 2009.
Jin, O., Liu, N., Zhao, K., Yu, Y., Yang Q. Transferring Topical Knowledge from Auxiliary Long Texts for Short Text Clustering. 2011.
Seifzadeh S., Farahat A., Kamel M. Short-Text Clustering using Statistical Semantics. 2015.

A surface form of a word is its form in the text. For example, "run", "ran", "running" are all surface forms of "run". ↩

Simulating the swarm

07.16.2015

etc

For some time, I've been enamored with an idea that Tim Hwang once told me about creating bots for the purpose of testing a social network, rather than for purely malicious or financial reasons. Say you're developing a new social networking platform and you've thought ahead far enough to implement some automated moderation features. How do you know they'll work as expected?

Imagine you could simulate how users are expected to behave - spawn hundreds to thousands to millions of bots with different personality profiles, modeled from observed behaviours on actual social networks, and let them loose on the network. Off the top of my head, you'd need to model things at the platform/network and agent (user) level.

At the platform level, you'd want to model:

When a new user joins
How users are related on the social graph, in terms of influence. With this, you can model how ideas or behaviours spread through the network.

At the agent (user) level:

When a user leaves the network
When a user sends a message
- What that message contains (no need for detail here, perhaps you can represent it as a number in [-1, 1], where -1=toxic and 1=positive)
- Who that message is sent to
What affects user engagement

In terms of user profiles, some parameters (which can change over time) might be:

Level of toxicity/aggressiveness
Base verbosity/engagement (how often they send messages)
Base influence (how influential they are in general)
Interest vectors - say there are some ¦n¦ topics in the world and users can have some discrete position on them (e.g. some users feel "1" towards a topic, others feel "2", still others feel "3"). You could use these interest vectors to model group dynamics.

Finally, you need to develop some metrics to quantify how well your platform does. Perhaps users have some "enjoyment" value which goes down they are harassed, and you can look at the mean enjoyment of the network. Or you could how often users start leaving the network. Another interesting thing to look at would be the structure of the social graph. Are there high levels of interaction between groups with distant interest vectors (that is, are people from different backgrounds and interests co-mingling)? Or are all the groups relatively isolated from one another?

You'd also have to incorporate the idiosyncrasies of each particular network. For instance, is there banning or moderation? You could add these as attributes on individual nodes (i.e. is_moderator=True|False). This can get quite complex if modeling features like subreddits, where moderator abilities exist only in certain contexts. Direct messaging poses a problem as well, since by its nature that data is unavailable to use for modeling behaviour. Reddit also has up and down voting which affect the visibility of contributions, whereas Twitter does not work this way.

Despite these complications, it may be enough to create some prototypical users and provide a simple base model of their interaction. Developers of other networks can tailor these to whatever features are particular to their platform.

With the recently released Reddit dataset, consisting of almost 1.7 billion comments, building prototypical user models may be within reach. You could obtain data from other sites (such as New York Times comments) to build additional prototypical users.

This would likely be a crude approximation at best, but what model isn't? As it is said, models can be useful nonetheless.

These are just some initial thoughts, but I'd like to give this idea more consideration and see how feasible it is to construct.

Designing Digital Communities @ SRCCON

07.01.2015

etc

What's the worst online community you could imagine? That's what we asked participants at SRCCON this year, in our session Designing Digital Communities for News.

The responses were enlightening - we learned a great deal about how people's concerns varied and which concerns were common. In addition to concerns about trolling and safety, many of the responses were focused on what content was seen and which community members were able to contribute that content. Here are some of our findings:

Moderation came up in almost every group, and many groups had both "no moderation" or too much or unpredictable moderation in their worst communities. Moderation without rationale – e.g. getting banned without being told why – also came up.
Expertise was a concern – the idea that users could inflate their own expertise on an issue and that it's easy for false information to propagate.
Several groups mentioned dominant personalities, i.e. communities in which the loudest voices have the most traction and quiet, meaningful voices are overpowered.
Contribution quality ranking came up several times, in terms of arbitrary ranking systems (e.g. comments with the most words go to the top) or lack of any quality heuristics at all.
Lack of threading came up several times; it was the most common UI concern across the groups. One group mentioned lack of accessibility as a problem.
Ambiguity came up as a problem in many different ways: lack of explicit and visible community values/guidelines, an unclear sense of purpose or focus, and even concerns about intellectual property – who owns the contributions?
Some groups brought up the collection of arbitrary and excessive personal information as problematic. Similarly, unnecessarily restrictive or permissive terms of services/licenses were also a concern ("perpetual license to do anything [with your data]"), as was the unnecessary display of personal information to others.
Complete anonymity (e.g. no user accounts or history) was mentioned a few times as something to be avoided. Only one group mentioned forced IRL identities (using your SSN as a username).
No control over your own contributions came up a couple times - not being able to edit, delete, or export your own submissions.
The worst communities would be powered by slow software, the groups said — all the better to ratchet up feelings of frustration.
A few communities are notorious as examples of a "bad community" - Reddit, YouTube, and LinkedIn cropped up a few times.

from *Calvin & Hobbes*, by Bill Watterson

After the dust settled, attendees then designed their ideal online community. As you might expect, some of the points here were just the opposite of what came up in the worst community exercise. But interestingly, several new concerns arose. Here are some of the results:

Clarity of purpose and of values – some explicitly mentioned a "code of conduct".
Accessibility across devices, internet connections, and abilities.
There was focus on ease of onboarding. One group pointed to this GitHub guide as an example, which provides a non-technical and accessible introduction to a technical product.
A couple of groups mentioned providing good tools that users could utilize for self-moderation and conflict resolution.
Transparency, in terms of public votes and moderation logs.
Increasing responsibility/functionality as a user demonstrates quality contributions.
Still a sense of small-community intimacy at scale.
A space where respectful disagreement and discussion is possible.
Options to block/mute people, and share these block lists.
Graduated penalties – time blocks, then bans, and so on.
Some kind of thematic unity.

Some findings from our other research were echoed here, which is validating for how we might focus our efforts.

It was interesting to see that there was a lot of consensus around what constituted a bad online community, but a wider range of opinions around what a good community could look like. We definitely have no shortage of starting places and many possible directions. We're looking forward to building some of these ideas out in the next few months!

By Francis Tseng & Tara Adiseshan for the Coral Project.

Threat Modeling in Digital Communites

04.08.2015

etc

As part of my OpenNews fellowship I've recently started working on the Coral Project, a Knight-funded joint venture between the New York Times, the Washington Post, and Mozilla which, broadly, is focused on improving the experience of digital community online. That mission is an catch-all for lots and lots of subproblems; the set I'm particularly drawn to are those issues around creating inclusive and civil spaces for discussion.

Any attempt at this must contend with a variety of problems which undermine and degrade online communities. To make the problem more explicit, it's helpful to have a taxonomy of these "threats". I'll try to avoid speculating on solutions and save that for another post.

Trolls/flamers/cyberbullying

The most visible barrier to discussion spaces are deliberately toxic actors - generally lumped together under the term "trolls"*.

I think most people are familiar with what a troll is, but for the sake of completeness: trolls are the users who go out and deliberately attack, harass, or offend individuals or groups of people.

If you're interested in hearing more about what might motivation a troll, this piece provides some insight.

Astroturfing

Any mass conglomeration of spending power or social capital soon becomes a resource to be mined by brands. So many companies (and other organizations) have adopted the practice of astroturfing, which is a simulated grassroots movement.

For instance, a company gets a lot of people to rave about their products until you too, just by sheer exposure (i.e. attrition), adopt a similar attitude as your baseline. This is a much more devious form of spam because it deliberately tries to misshape our perception of reality.

This can increase the amount of noise in the network and reduce the visibility/voice of legitimate members.

Sockpuppeting/Sybil attacks

A common problem in ban-based moderation systems is that barriers-to-entry on the site may be low enough such that malicious actors can create endless new accounts with which to continue their harassment. This type of attack is called a Sybil attack (named after the dissociative identity disorder patient).

Similarly, a user may preemptively create separate accounts to carry out malicious activity, keeping deplorable behavior distinct from their primary account. In this case, the non-primary accounts are sockpuppets.

It seems the problem with Sybil attacks is the ease of account creation, but I don't think a solution to the Sybil attacks is to make barriers-to-entry higher. Rather, you should ask whether banning is the best strategy. Ideally, we should seek to forgive and reform users rather than to exclude them (I'll expand on this in another post). This solution is dependent on whether or not the user is actually trying to participate in good faith.

Witch hunts

This is the madness of crowds that can spawn on social networks. An infraction, whether it exists or not, whether it is big or small, becomes viral to the point that the response is disproportionate by several orders of magnitude. Gamergate, which began last year and now seems to be permanent part of the background radiation of the internet, is an entire movement that blew up from a perceived - that is, non-existent, nor particularly problematic - offense. In these cases, the target often becomes a symbol for some broader issue, and it's too quickly forgotten that this is a person we're talking about.

Eternal September

"Eternal September" refers to September 1993, when AOL expanded access to Usenet caused a large influx of new users, not socialized to the norms of existing Usenet communities. This event is credited with the decline of the quality of those communities, and now generally refers to the anxiety of a similar event. New users who know nothing about what a group values, how they communicate, and so on come in and overwhelm the existing members.

Appeals to "Eternal September"-like problems may themselves be a problem - it may be used to rally existing community members in order to suppress a diversifying membership, in which case it's really no different than any other kind of status quo bias.

To me this is more a question of socialization and plasticity - that is, how should new members be integrated into the community and its norms? How does the community smoothly adapt as its membership changes?

Brigading

Brigading is the practice where organized groups suss out targets - individuals, articles, etc - which criticize their associated ideas, people, and so on and go en masse to flood the comments in an incendiary way (or otherwise enact harmfully).

This is similar to astroturfing, but I tend to see brigading as being more of a bottom-up movement (i.e. genuinely grassroots and self-organized).

Doxxing

Doxxing - the practice of uncovering and releasing personally identifying information without consent - is by now notorious and is no less terrible than when it first became a thing. Doxxing is made possible by continuity in online identity - the attacker needs to connect one particular account to others, which can be accomplished through linking the same or similar usernames, email addresses, or even personal anecdotes posted across various locations. This is a reason why pseudonyms are so important.

Swatting

Swatting is a social engineering (i.e. manipulative) "prank" in which police are called in to a investigate possible threat where there is none. It isn't new but seems to have had a resurgence in popularity recently. What was once an activity for revenge (i.e. you might "swat" someone you didn't like) now seems to be purely for the spectacle (i.e. done without consideration of who the target is, just for lulz) - for instance, someone may get swatted while streaming themselves on Twitch.tv.

The Fluff Principle

The "Fluff Principle" (as it was named by Paul Graham) is where a vote-driven social network eventually comes to be dominated by "low-investment material" (or, in Paul's own words, "the links that are easiest to judge").

The general idea is that if a piece of content takes one second to consume and judge, more people will be able to upvote it in a given amount of time. Thus knee-jerk or image macro-type content come to dominate. Long-form essays and other content which takes time to consume, digest, and judge just can't compete.

Over time, the increased visibility of the low-investment material causes it to be come the norm, and so more of it is submitted, and so the site's demographic comes expecting that, and thus goes the positive feedback loop.

Power-user oligarchies

In order to improve the quality of content or user contributions, many sites rely on voting systems or user reputation systems (or both). Often these systems confer greater influence or control features in accordance with social rank, which can spiral into an oligarchy. A small number of powerful users end up controlling the majority of content and discussion on the site.

Gaming the system

Attempts to solve any of the above typically involve creating some kind of technological system (as opposed to a social or cultural one) to muffle undesirable behavior and/or encourage positive contribution.

Especially clever users often find ways of using these systems in ways contradictory to their purpose. We should never underestimate the creativity of users under constrained conditions (in both bad and good ways!).

Whether or not some of these are problems really depends on the community in question. For instance, maybe a site's purpose is to deliver quick-hit content and not cerebral long-form essays. And the exact nature of these problems - their nuances and idiosyncrasies to a particular community - are critical in determining what an appropriate and effective solution might be. Unfortunately, there are no free lunches :\

(Did I miss any?)

* The term "troll" used to have a much more nuanced meaning. "Troll" used to refer to subtle social manipulators, engaging in a kind of aikido in which they caused people to trip on their own words and fall by the force of their own arguments. They were adept at playing dumb to cull out our own inconsistencies, hypocrisies, failures in thinking, or inappropriate emotional reactions. But you don't see that much anymore...just the real brutish, nasty stuff.

<< >>