Web of Data

data and computing at scale

Search

//

you're reading...

A case for Central Points of Access (CPoA) in decentralised systems

Posted by woddiscovery ⋅ 2010-02-18 ⋅ 4 Comments

Filed Under decentralised, DNS, Linked Open Data, lookup, P2P, promote, registry, stores, UDDI

This post has been triggered by a Twitter thread, where I replied to @olyerickson that I think https://subj3ct.com is a good thing to have. Then, @hvdsomp noted (rightly!) that registries don’t scale (in reference to a conversation we had earlier on).

Big confusion, right? Michael says one thing and then the opposite on the very next day. Nah, no really 😉

Actually, turns out I’ve been quite consistent over time. In late 2008 I wrote in Talis’ NodMag #4 (on page 16):

Could you imagine reporting your new blog post, Wiki page or whatever you have to hand to an authority that takes care of adding it to a ‘central look-up repository’? I can’t, and there is at least one good reason for it: such things don’t scale. However, there are ways to announce and promote the content.

So, what is the difference between a UDDI-style registry (which, btw, not to exactly turned out to be a success) and, what I’ll call a central point of access (CPoA) in the following?

Before I try to answer the question, let me first give you some examples of CPoAs in the Web of Data context:

Services that reconcile data from different sources, like uberblic.org
Co-reference lookups, such as sameas.org;
Generic Web of Data indexer such as the LOD Cloud Cache, Falcons or Sindice;
A namespace/prefix lookup: prefix.cc;
Ontology lookups such as Cupboard;
‘Single-dataset’ lookups such as DBpedia’s lookup service or the LinkedGeoData online access interface;
Dataset description stores, such as the RKB voiD store.

Some of these CPoAs employ automated techniques to fill their internal databank (such as Sindice or sameas.org), some of them depend on human input (for example prefix.cc). Some of them focus on a special kind of use case or domain (Cupboard or voiD stores), some try to be as generic as possible (Falcons, Sindice).

All of them, though, do share one principle: it’s up to you if you’re listed there or not (ok, technically, some might discover your data and index it, but that’s another story). The (subtle) difference is a-prior vs. a-posterior: no one forces you to submit, say your voiD file to a voiD store or to Sindice. However, if you want to increase your visibility, if you want people to find your valuable data, want them to use it, you’ll need to promote it. So, I conclude: one, effective way to promote your data (and schema, FWIW) is to ‘feed’ CPoA. Contrast this with a centralised registry where you need to submit your stuff first, otherwise no one is able to find it (or, put in other words: if you don’t register, you’re not allowed to participate).

There are exceptions I’m aware of: DNS, for example, which works, I think, mainly due to its hierarchical aspect. Other approaches can be pursued as well, for example P2P systems come to mind.

Nevertheless, I stand by it: centralised, forced-to-sign-up registries are bad for the Web (of Data). They do not scale. CPoA, such as listed above are not only good for the Web (of Data) but essential to make it usable; especially to allow to bridge the term-URI gap (or: enter the URI space), which I’ll flesh out in another post. Stay tuned!

About woddiscovery

Web of Data researcher and practitioner

View all posts by woddiscovery »

Discussion

4 thoughts on “A case for Central Points of Access (CPoA) in decentralised systems”

Hi Michael,

I think that Subj3ct probably fits under your definition of a CPoA. One thing to note about Subj3ct is that anyone can register a feed – you don’t need to be a registered user to do that. Our simple trust metric is partially based on whether or not the feed was registered by a logged in Subj3ct user and whether or not the feed is under a namespace registered by a Subj3ct user, but neither of these things are compulsory.

In terms of scaling we have similar concerns. In its current implementation I can see Subj3ct having both technical and social issues in scaling. What we are hoping to do is to encourage/sponsor/do an open-source implementation of the Subj3ct server with some added server-to-server trust and/or publishing protocols to enable any person or organisation to run their own Subj3ct with some form of either delegation through to trusted servers or replication from trusted servers. My feeling is that something like this is the best strategy for balancing the needs of Linked Data producers and consumers.

Cheers

Kal
(One of the coders of Subj3ct)

Posted by Kal Ahmed | 2010-02-18, 12:56

Reply to this comment
Michael,

I guess the LOD Cloud Cache instance at: http://lod.openlinksw.com doesn’t count as a lookup service? How is it different from: sameas.org, Sindice, Uberblic, Falcons, and others you’ve listed above?

The whole idea behind the EAV/CR model that drives the burgeoning Web of Linked Data is this:

We have a Web Scale Distributed DBMS where every record is a 3-tuple (TBox and ABox). Inference Rules enable users of this system (depending on the engine behind a given data space) to perform context specific data reconciliation. BTW – of the tools you list above I don’t know any that offer context specific data reconciliation via backward chained reasoning — as we do and demonstrate via our service.

Remember, the DBMS realm is close to 50 years old. Linked Data is still fundamentally a realm within the broader DBMS realm (and I don’t mean RDBMS).

All DBMS solutions are dependent on their ability to provide Lookup and Record Joins across a variety of data containment boundaries (which could be Named Graph in RDF Store land). In addition, DBMS technology has understood data replication and synchronization for a long time too.

Let’s try to spend more time reconciling Linked Data to what exists rather than giving the impression that everything in this realm is new and novel etc..

Linked Data DNS is something I spoke about at the inaugural Linked Data session in Banff 2007. There is even an RDFSync protocol (from Giovanni and Orri) gathering dust with DNS replication equivalent for RDF in mind.

Kingsley

Posted by Kingsley Idehen | 2010-02-18, 18:13

Reply to this comment
Side points:

re: “exceptions.. DNS, for example, which works”
The Web of Linked Data already built on this system, and hence why linked data is possible, and indeed works 🙂 (there may be scope for “something” in TXT dns records, unsure what though)

re: “Could you imagine reporting your new blog post, Wiki page or whatever you have to hand to an authority that takes care of adding it to a ‘central look-up repository’?”
AFAIK we do; you send out a “ping” (through 3rd party services) to google which takes care of adding it to the ‘central look-up repository’ which is google itself in most cases.

It also may be worth mentioning semantic ping in the above post?

Main thoughts:
At the minute it appears to me that much of the “web of data” is growing in a hierarchical or tree fashion – naturally the tech stack at the minute promotes one way or “back-links” from resource to the thing it is the sameAs; but rarely the other way (unless manual intervention by using silk or lookups on sameas.org etc) – thus I’d say that we are headed for the problems of the current web all over again; becoming reliant on services (like the ones you mentioned) to lookup data; rather than it all being interlinked and creating a proper “web” of data.

In the short term we can exploit these services to make things a bit more automated; for instance consider the following triple:

rdfs:seeAlso .

Whilst this doesn’t address the source of the problem directly, it does leverage current services to make things a bit easier for the machine.

Moving forward I believe that lessons can be learned from the blogging community; one much overlooked feature is the “trackback”; where one datasource notifies another that it has published data which references or duplicates the original data. Such a system in the linked data world could work wonders; for instance let’s say some system was implemented where I could inform a graph that I have an sameAs resource in my own graph. Purely by implementing this functionality we would address the resource hierarchy and one way linking that is creeping in at the minute.

On the same note; for some time now I’ve had the thought that graphs should/could implement RESTful functionality; why do we only HTTP GET a graph/resource? GET has been much overused for years, hence why we have “web browsers” and not “web clients” or full user-agents – why for instance can’t I issue an HTTP PUT to a graph URI, or a DELETE, or in this case a POST.

Let’s just imagine for a minute that all the aforementioned was implemented; I could simply HTTP POST some serialised triples to http://dbpedia.org/resource/London and it could be added to the graph containing that subject.

To really address the above though Provenance and Reification would need addressed properly; swp:assertedBy and swp:quotedBy on a per graph level would need to be on a per triple level rather than per graph.

OR perhaps both ideas I’ve mentioned could be combined to create something seamless and using the techs as currently advised.. as follows..

Let’s say that Authorities, Authorization and Warrents [1] was implemented over at dbpedia; and that the default graph containing was swp:assertedBy a trusted source – obviously inserting a new untrusted/quoted triple in to this graph would break the provenance and assertion – however we could also have an untrusted or swp:quotedBy graph to handle all this data; and rdfs:seeAlso links per resource to the untrusted graph(s).

Implementation-wise (and very briefly) this could work as follows:

Client HTTP POSTs some serialized triples to http://dbpedia.org/resource/London (for instance a sameAs triple)

The web server redirects the request to the untrusted graphs URI; which then accepts the triples (perhaps does some validation on them in the background) and adds them to the untrusted graph; returning the relevant status code to the client.

OR.. perhaps all owl:sameAs links for untrusted sources could be swapped to rdfs:seeAlso; this way you are not asserting that the two things are equivalent, you’re simply asserting that more information may be available ; which wouldn’t mean you were specifying incorrect information and thus Provenance for the graph would remain intact.

OR.. introduction of something like graphServiceURL / graphServiceType together with a new / existing RDF update protocol

Just brain storming, but there certainly seems scope to both utilize these services more via automated machine understanding, and to make the web of data more of a web, than a hierarchy /tree.

Posted by Nathan | 2010-02-18, 18:16

Reply to this comment
Kingsley,

of course, the LOD Cloud Cache instance at http://lod.openlinksw.com is a CPoA. A very good example, indeed. Though, I didn’t really intend to give a comprehensive list nor did I promise this somewhere in the post.

Or, alternatively: http://lod.openlinksw.com is so well known in the community that it didn’t come to my mind to explicitly list it. Anyway, will update the post – thanks for the ‘reminder’ 😉

Cheers,
Michael

Posted by woddiscovery | 2010-02-19, 10:26

Reply to this comment

Web of Data

Search

A case for Central Points of Access (CPoA) in decentralised systems

About woddiscovery

Discussion

4 thoughts on “A case for Central Points of Access (CPoA) in decentralised systems”

Leave a reply to Nathan Cancel reply

Tags

Archives

Meta

Web of Data

Search

A case for Central Points of Access (CPoA) in decentralised systems

Share this:

Related

About woddiscovery

Discussion

4 thoughts on “A case for Central Points of Access (CPoA) in decentralised systems”

Leave a reply to Nathan Cancel reply

Tags

Archives

Meta