web page hit counter

Sunday, February 14, 2010

Linked Data, Confidence Games and the Transitivity of Trust

Over the Christmas holidays I took my family on a five thousand mile roadtrip around the American West. It took a couple of weeks and I expected to spend a lot of time on my favorite user-generated travel review site.

And I did spend a lot of time on the site, enough to eventually figure out that it had been comprehensively infiltrated by review spammers. Some of the spam reviews were obvious: "I loved this place! Five stars!" when all the rest of the reviews were negative. Some were more devious: "There were bedbugs! They spat in my soup! Zero stars!" when all the other reviews were stellar. In other cases it was much harder to tell, and in all cases the average rating was highly suspect.

Turns out there are companies that specialize in vandalizing review sites[1]. The companies employ actual humans who spend actual creative effort to craft misleading reviews. They even set up realistic user profiles, and on some sites they add each other as friends. In other words, it's considered worthwhile to spend real time and effort on this stuff.

It's been suggested that there's a technological solution: If the reviewers are part of a social network, it's possible to extract some useful statistics that might help determine if the user is real or fake.

If the reviewer is a friend, that's obviously useful information. But there's very little chance that some random reviewer is your friend.

But what if the reviewer is part of your extended social network? Surely the fact that somebody is a friend of a friend is some indication that they're trustworthy, or at least that they're a real person.


First off, with a fan-out of 200 friends the 2nd-level extended social graph is around 40,000 people. Allowing for annoying people who friend everybody, an extended social graph could easily include a substantial portion of the entire population of the planet. All it takes is a couple of mistaken friend-adds to get you hooked up to a spammer-created sub-network. Even if you're careful, it's overwhelmingly likely that some friend-of-a-friend isn't.

So, trust is clearly not transitive and the idea of a "web of trust" cannot be taken literally[2].

In most cases, it's only possible to determine if the "shape" of the reviewer's social graph is reasonable. That is, are they friends with other plausible-looking people? Are many of their friends known fake profiles? Do they have a realistic number of friends? Etc.

But that's trivial to game. Even if there are obstacles to a totally automated approach, the application of ultra-cheap human labor makes it easy to set up a fake social network on any given site.

Linked data and distributed social graphs (ala FOAF + SSL) make things worse, because while before it took some amount of human effort to solve the captchas and create new accounts on a social-graph silo like Facebook, with a distributed "web of trust" approach it can all be completely automated.

That isn't to say FOAF + SSL isn't a neat replacement for the monstrosity that OpenID has become, but the "web of trust" part won't fly.

That said, in some sense it doesn't really matter. I'm certainly not arguing that we should slow in our rush towards a semantic web. The benefits are too great. But given the experience with email spammers and review fraudsters, it might be a good idea to be open about the fact that we're also introducing new hazards.

[1] So, honestly, I only have anecdotal evidence. But it doesn't seem like a very controversial assumption.

[2] "Trust" is a complicated word. It's not that knowing a review is by a friend-of-a-friend-of-a-friend isn't useful information, it's that using it to make a binary yes/no trust decision is misguided. There's been some interesting academic research in this area, Wikipedia has a rundown: http://en.wikipedia.org/wiki/Web_of_trust In what seems like a perfectly sensible approach, this paper: http://www.mindswap.org/papers/Trust.pdf suggests using social graph information as just one input into a full spam handling system.

You should follow me on twitter here.