Linked Data, Confidence Games and the Transitivity of Trust
Over the Christmas holidays I took my family on a five thousand mile roadtrip around the American West. It took a couple of weeks and I expected to spend a lot of time on my favorite user-generated travel review site.
And I did spend a lot of time on the site, enough to eventually figure out that it had been comprehensively infiltrated by review spammers. Some of the spam reviews were obvious: "I loved this place! Five stars!" when all the rest of the reviews were negative. Some were more devious: "There were bedbugs! They spat in my soup! Zero stars!" when all the other reviews were stellar. In other cases it was much harder to tell, and in all cases the average rating was highly suspect.
Turns out there are companies that specialize in vandalizing review sites[1]. The companies employ actual humans who spend actual creative effort to craft misleading reviews. They even set up realistic user profiles, and on some sites they add each other as friends. In other words, it's considered worthwhile to spend real time and effort on this stuff.
It's been suggested that there's a technological solution: If the reviewers are part of a social network, it's possible to extract some useful statistics that might help determine if the user is real or fake.
If the reviewer is a friend, that's obviously useful information. But there's very little chance that some random reviewer is your friend.
But what if the reviewer is part of your extended social network? Surely the fact that somebody is a friend of a friend is some indication that they're trustworthy, or at least that they're a real person.
Nope.
First off, with a fan-out of 200 friends the 2nd-level extended social graph is around 40,000 people. Allowing for annoying people who friend everybody, an extended social graph could easily include a substantial portion of the entire population of the planet. All it takes is a couple of mistaken friend-adds to get you hooked up to a spammer-created sub-network. Even if you're careful, it's overwhelmingly likely that some friend-of-a-friend isn't.
So, trust is clearly not transitive and the idea of a "web of trust" cannot be taken literally[2].
In most cases, it's only possible to determine if the "shape" of the reviewer's social graph is reasonable. That is, are they friends with other plausible-looking people? Are many of their friends known fake profiles? Do they have a realistic number of friends? Etc.
But that's trivial to game. Even if there are obstacles to a totally automated approach, the application of ultra-cheap human labor makes it easy to set up a fake social network on any given site.
Linked data and distributed social graphs (ala FOAF + SSL) make things worse, because while before it took some amount of human effort to solve the captchas and create new accounts on a social-graph silo like Facebook, with a distributed "web of trust" approach it can all be completely automated.
That isn't to say FOAF + SSL isn't a neat replacement for the monstrosity that OpenID has become, but the "web of trust" part won't fly.
That said, in some sense it doesn't really matter. I'm certainly not arguing that we should slow in our rush towards a semantic web. The benefits are too great. But given the experience with email spammers and review fraudsters, it might be a good idea to be open about the fact that we're also introducing new hazards.
[1] So, honestly, I only have anecdotal evidence. But it doesn't seem like a very controversial assumption.
[2] "Trust" is a complicated word. It's not that knowing a review is by a friend-of-a-friend-of-a-friend isn't useful information, it's that using it to make a binary yes/no trust decision is misguided. There's been some interesting academic research in this area, Wikipedia has a rundown: http://en.wikipedia.org/wiki/Web_of_trust In what seems like a perfectly sensible approach, this paper: http://www.mindswap.org/papers/Trust.pdf suggests using social graph information as just one input into a full spam handling system.
And I did spend a lot of time on the site, enough to eventually figure out that it had been comprehensively infiltrated by review spammers. Some of the spam reviews were obvious: "I loved this place! Five stars!" when all the rest of the reviews were negative. Some were more devious: "There were bedbugs! They spat in my soup! Zero stars!" when all the other reviews were stellar. In other cases it was much harder to tell, and in all cases the average rating was highly suspect.
Turns out there are companies that specialize in vandalizing review sites[1]. The companies employ actual humans who spend actual creative effort to craft misleading reviews. They even set up realistic user profiles, and on some sites they add each other as friends. In other words, it's considered worthwhile to spend real time and effort on this stuff.
It's been suggested that there's a technological solution: If the reviewers are part of a social network, it's possible to extract some useful statistics that might help determine if the user is real or fake.
If the reviewer is a friend, that's obviously useful information. But there's very little chance that some random reviewer is your friend.
But what if the reviewer is part of your extended social network? Surely the fact that somebody is a friend of a friend is some indication that they're trustworthy, or at least that they're a real person.
Nope.
First off, with a fan-out of 200 friends the 2nd-level extended social graph is around 40,000 people. Allowing for annoying people who friend everybody, an extended social graph could easily include a substantial portion of the entire population of the planet. All it takes is a couple of mistaken friend-adds to get you hooked up to a spammer-created sub-network. Even if you're careful, it's overwhelmingly likely that some friend-of-a-friend isn't.
So, trust is clearly not transitive and the idea of a "web of trust" cannot be taken literally[2].
In most cases, it's only possible to determine if the "shape" of the reviewer's social graph is reasonable. That is, are they friends with other plausible-looking people? Are many of their friends known fake profiles? Do they have a realistic number of friends? Etc.
But that's trivial to game. Even if there are obstacles to a totally automated approach, the application of ultra-cheap human labor makes it easy to set up a fake social network on any given site.
Linked data and distributed social graphs (ala FOAF + SSL) make things worse, because while before it took some amount of human effort to solve the captchas and create new accounts on a social-graph silo like Facebook, with a distributed "web of trust" approach it can all be completely automated.
That isn't to say FOAF + SSL isn't a neat replacement for the monstrosity that OpenID has become, but the "web of trust" part won't fly.
That said, in some sense it doesn't really matter. I'm certainly not arguing that we should slow in our rush towards a semantic web. The benefits are too great. But given the experience with email spammers and review fraudsters, it might be a good idea to be open about the fact that we're also introducing new hazards.
[1] So, honestly, I only have anecdotal evidence. But it doesn't seem like a very controversial assumption.
[2] "Trust" is a complicated word. It's not that knowing a review is by a friend-of-a-friend-of-a-friend isn't useful information, it's that using it to make a binary yes/no trust decision is misguided. There's been some interesting academic research in this area, Wikipedia has a rundown: http://en.wikipedia.org/wiki/Web_of_trust In what seems like a perfectly sensible approach, this paper: http://www.mindswap.org/papers/Trust.pdf suggests using social graph information as just one input into a full spam handling system.
You should follow me on twitter here.
7 Comments:
Thanks for this post, Christopher!
I think you've identified a fundamental problem with the FOAF+SSL model: the "policy" side has not been sufficiently explored.
The core idea is sound; it is a form of attribute-based --- sometimes called semantic --- access control. As with each new model that hopes to improve some "*ility" of access control, it has promise. The core problem (IMHO) is that the FOAF space is too dilute, especially given the simplistic policies that have been examined to date.
There are at least two ways to improve the situation: tougher policies (on the service side) and less dilute relationship spaces.
There is much to do in terms of defining the shape of the relationship webs we should trust. Policy webs coupled with proprietary relationship vocabularies are much better tools than "who-knows-whom" based on FOAF that have been tested so far...
As you've stated in this post: "Trust is Complicated", so why do you assume that a single relation (i.e. foaf:knows) is what determines a "Web of Trust" (WOT)?
What makes the foaf:knows relation predicate transitive? I don't see as being true re. the FOAF ontology.
Why do you assume that the degrees of separation are a constant in any WOT?
Why do you assume that privileges last forever in any WOT? Is this the case in real life?
I envisage a Web of Trust to be comprised of:
1. Personal Identifiers (WebIDs i.e, Person Entity URIs that enable FOAF+SSL)
2. Authentication (FOAF+SSL's sole purpose via binding of X.509 cert, Private Key, and WebID),
3. Data Access Policies (sophisticated rules that leverage FOAF+SSL).
All of the above, leveraging Linked Data graphs, are they keys to the solution.
Like the things that ultimately work on the Web, this is about a "Deceptively Simple" rather than "Simply Simple" solution to distributed/federated identity.
Kingsley
There seems to be a misconception that FOAF+SSL is limited to an informal Web-of-Trust model.
While this WoT model is indeed possible with FOAF+SSL, and was the motivation for a "grass-root" approach, because it enables individuals to form a network of trust without the burden of a large administrative structure.
As these networks of peers trusting each other grow (or are required to grow), the rules required to trust various pieces of information need to be formalised; this is why formal organisations have those administrative structures. Then, this is no longer as much a technical problem as it is about administration, organisation policies and ultimately liability.
FOAF+SSL does not prevent you from establishing format rules of trust, on the contrary. What it allows individuals to have is their own identifier (and way to verify that identity), while decoupling the management of information about this identifier (via semantic web discovery).
Like in any system, or in real life, service providers have to use information they gather from sources they trust. This source could be an organisation's head office, a small group of friends or a government. Some of them can be very hierarchical, others will be more informal; This will certainly depend on what the service is and what pieces of information are required for authorisation (and audit).
So I think both Kingsley and Bruno are correct to decouple authentication from authorization concerning FOAF+SSL.
I think the problem is we current don't have enough examples --- meaning real, shareable, repeatable implementations in code --- of "interesting" (e.g.) data access policies based on the FOAF+SSL model.
The shareable examples we do have are very simple (who knows whom) and lead to the concerns expressed and the apparent "confusion"; the more interesting examples --- I could be mistaken here --- are proprietary (based on tools embedded in particular platforms) and thus not repeatable.
Please feel free to enumerate here the many ways I'm wrong! ;)
Kinglsey: My assumptions about the authorization model were based on the descriptions at foaf+ssl. The words "web of trust" mean that some sort of transitivity is involved, and the examples imply the scenario I've described. The infrastructure is obviously capable of supporting many different authorization models, including a simple ACL. I actually like ACLs, they're easy to understand. Your item #3 "Data Access Policies" is a problem. It assumes that it's possible to come up with a reasonable web-of-trust based policy. What if I claim it isn't possible outside very restricted contexts? Give me some specifics to pick at.
olyerikson: I see what you mean about richer policy models, but that path quickly runs into problems familiar to enterprise security architects. Non-simplisitc models are hard to predict, but it must be easy to predict the consequences of a security policy otherwise human beings can't effectively judge if the policy is doing what they want. Facebook has struggled with this very publicly, and Google is running into it with Buzz. Without writing another blog post here, I'd suggest that the problem of a trusted friend suddenly friend'ing somebody who you consider unreliable might not be solvable no matter how sophisticated the security policy is. Maybe "here's a fixed access control list" is about as complex as most people really want to deal with.
Data Access Policies, like "Beauty" have to be context specific, if not, then what use do they serve?
"Web" is not World Wide Web to me, by default; It means: a Data Network Mesh which may be private (Intranet), semi-private (extranet), or public (World Wide Web).
Security is a multi-dimensional problem compounded by social vectors. We have to be able to make "Clubs" and "Groups" on any data network based Web of Trust.
I want to be able to say: this picture is only available to my "relatives" and I control the definition of "relatives" which may feed off some shared ontology for this discourse realm.
Bottom line, private keys, linked data based data meshes, and context specific policies are the keys to the Webs of Trust that I envisage.
I will certainly be putting out my own examples of FOAF+SSL driven ACLs as per my usual approach to demonstrating whatever my position are re. Linked Data etc..
If you have a Personal URI (WebID) we can also perform some tests etc..
Christopher, I agree that by introducing policy "richness" we usually also introduce management complexity. This has been my personal experience designing and building novel attribute-based access control systems, which gave administrators the desired ability to create finely-tuned policies over arbitrary {Subject|Object|Action} policy sub-domains.
I mention attribute-based policy regimes here because they are related to so-called semantic access control. The definitions of the sub-domains for a given policy --- the Subject, Object, and Action to which it applies --- are defined by logical expressions over the objects' attributes.
I argued for this approach because it was well-suited to graph oriented DB models, which the target application was. To add a particular kind of control, you first created the policy, and then added the attributes to objects in the DB --- the new relationships --- that caused the policy to apply to the objects.
ACLs are modeled in such systems by (for example) adding an attribute to designate membership in the policy subject sub-domain to which a particular policy ("allow access to John's vacation images") applies. On the surface this sounds different than the FOAF membership-based approach, but it is still an example of the reverse query problem that you have in such systems: is the Subject in question a member of the set to which this policy applies? (repeat for all policies to find the set of policies that must be tested...)
Post a Comment
<< Home