Update: for the dataset dynamics demo developed during the Linked Data Camp Vienna there is now also a screen-cast available (video, slides in PDF):
Update: for the dataset dynamics demo developed during the Linked Data Camp Vienna there is now also a screen-cast available (video, slides in PDF):
Jürgen Umbrich and I virtually participated in the LDC09 session regarding datasets dynamics.
Over the past couple of days, we hacked a little demo on a distributed change notification system for Linked Open Data, based on voiD+dady and (a slightly modified version ) of an Atom feed. Here is the overall setup:
In case you want to play around with it yourself, you can check out the source code as well. Feedback and feature requests welcome ;)
I recently wrote about caching support in the Linked Open Data here and got nice feedback (and questions ;) from dataset publisher. In a follow-up mail exchange, Mark Nottingham was to kind to provide me with two very valuable resources I’d like to share with you:
Please let me know if you are aware of more resources in this area (studies, etc.) and I’ll post it here!
So, the other day I had a look at caching support in the Linked Open Data cloud and it turns out that there is a related discussion regarding caching on the ietf-http-wg@w3.org mailing list.
Then, there is another related update from Bill Roberts: Delivering Linked Data quickly with which I wholeheartedly agree.
To take the entire stuff a step further I tried to outline the overall problem in a short slide deck (best viewed full-screen ;)
My hunch is that 80% of the stuff is already out there available (such as Atom, Changeset vocabulary, voiD, etc.) and only minor pieces are missing. Next step would be to hammer out a simple demo and gather some more experiences with it. In case you are interested to chime in let me know :)
The other day I was pondering on Linked Open Data Source Dynamics and as a starting point I wanted to learn more about the caching characteristics of LOD data sources. Now, in order to establish a baseline, one should have a look at what HTTP, one of the pillars of Linked Data, offers (see also RFC2616, Caching in HTTP).
So, I hacked a little PHP script that takes 17 sample resources from the LOD cloud (from representative datasets ranging from DBpedia over GeoSpecies to W3C Wordnet). The results of the LOD caching evaluation are somewhat deflating: more than half of the samples do not support cache control and less than 20% support Last-Modified or ETag headers.
I know, I know, this is just a very limited experiment. And yes, very likely there are not yet that many applications out there consuming Linked Data and hence using up the whole bandwidth. However, given that one of the arguments for the scalability on the Web is the built-in HTTP caching mechanism, LOD dataset publisher might want to consider having a closer look into what the server or platform at hand is able to offer concerning caching support.
Having read Adam Jacobs’ The Pathologies of Big Data and Stefano Mazzocchi’s Data Smoke and Mirrors I found myself asking: what is the motivation for people to publish linked data, and in turn to consume it (sounds funny you think? well, just because the data is available doesn’t necessarily mean it is useful or actually used ;)
Ok, so let’s start with a nice statement from Adam’s ACM article:
Here’s the big truth about big data in traditional databases: it’s easier to get the data in than out.
Yup, I think I agree and I guess the same is true for Linked Data. There are tons of ‘cheap’ ways to publish in RDF (for example, regarding relational databases, we’re currently try to define a standard). However, there is still a need for high quality data and high quality links between the data items in order to allow the data to be used sensibly in applications!
Right, so my hunch is that for data providers there are a couple of reasons to publish their data in an open and easily accessible way, but I guess one main reason may be that due to providing the raw data, one can simply cut costs. Rather than writing a Web application that serves humans and offering an additional Web service/API (such as flickr or delicious did) , one can expose the original data directly via Linked Data and open up the possibility for others to develop cool applications on top of it (see also our recent work in this direction).
On the other hand, data consumers benefit from a single (RESTful) API with a uniform data model (RDF, in case it isn’t that obvious ;), which in turn enables simplified development of applications and allows the reuse of data (just like the BBC doesn’t have to maintain the artist and song data themselves anymore, but reuses MusicBrainz data).
Let me know – what is your incentive to publish/consume Linked Data?
So, you took the red pill? You’re a full blown RESTafarian brother? Good news for you, then. You’ll understand linked data in less then 30sec. Ok. Step by step. REST, understood as a ’set of constraints that inform an architecture’:
… and now read the linked data principles with your ‘REST goggles’ on:
In the linked data, we use HTTP URIs for everything. For documents, but also for concepts or real-world entities such as people. Linked data provides a uniform (read-only) interface through HTTP GET. The messages are self-describing through RDF and RDF-based vocabularies and through the last of the linked data principles, what we have in the LOD cloud is a highly connected (or: interlinked) system.
As nicely described by Leonard Richardson and Sam Ruby in RESTful Web Services you design a RESTful (ROA) system in that you:
You’ll typically end up in a 3D design space such as the following (kudos to Cesare Pautasso and Erik Wilde):

The same actually happens when you publish linked data, with some simplifications: due to the read-only characteristic of linked data you only have to worry about one HTTP verb (GET) and with RDF as the unified data model (based on your preferences and needs) you pick one of the RDF serializations (preferably RDFa, as it nicely integrates with HTML and hence allows you to serve humans and programs). When you have your data in RDF (or so ;) you’ll mainly find yourself worrying how to interlink it with other data on the Web. But this really is a huge benefit – finally enabling to use the Web as one huge database.
As an aside: I’m aware of the fact that we still need to sort out some issues along the way, both in the academia and in practice. However, I encourage people in both camps (RESTful yadayada and Linked Data rogues) to look beyond one’s own nose and eventually understand that there is only one Web and we all ‘live’ in it ;)
Though the following might seem obvious to some of you, I thought I take the time to write some lines about the data life-cycle on the Web and try to highlight some implicit assumptions and processes. We all know the old story: data itself may not be very exciting and people actually like applications rather than scrolling through endless tables or view a CSV file in a text editor. However, data is what ultimately drives applications, and, to a certain extend, our life.
When my family moved over from Austria recently, I experienced personally how much data is involved in our everyday life. Want to find a nice house? Searching for it requires quite some data (on both ends) such as location, prize range, number of rooms, etc. Have to register? Again, data is needed (insurance numbers, birth dates, etc.). Looking for a new car? New mobile phone contract? etc.
Ok, you get the idea. We need the data. It is not an end in itself, though. I want to relocate, buy stuff, sell stuff, find a new job (ahm, not really, right – this was just an example ;) and so on. For all this I need data. It is not precisely that I’m so much interested in the data, but what I can do with the data. See above.
Enough motivation. What’s the message? Well, so far (essentially the past 15 years) we have seen people using data on the Web. In services, in documents, etc. – traditionally it would look a bit like:
However, we can do better. Two key technologies enable us to get rid of a conceptually unnecessary component (the screen scraper) and offer data directly to the applications (while still serving humans the nice CSS-styled and Ajax-powered HTML pages) – one is a concrete RDF serialization called RDFa, the other is a set of principles, called linked data. So, what is possible with the above mentioned is something like:
This is essentially a paradigm shift from consumer-pull (that is, using layout information in HTML to guess the semantics) to publisher-push (that is, the one who publishes the data along with the document explicitly declares what the data is and what its semantics are). All you need is a globally universal and uniform way to refer to entities (such as houses, cars, mobile phones, etc.), which turns out to be URIs, a way to move the data around (you’ve guessed it, it’s HTTP) and a common data model to structure your data (correct, we’re talking about RDF). How does this fit together? Well, the latter three technologies are the core of linked data, and RDFa is the way to deliver RDF in HTML. Sounds easy? It is ;)
Ok, enough theory. Now, two things to remember: first, this is not a vision or a dream. It’s reality. You can use it NOW. In your Web site, in your Web application. Second: it’s cheap. Just change your templates, which generate the HTML from the RDBMS or use a CMS which has built in support for it (for example, in Drupal you can already use it with some tiny configuration changes). And you can test and view the results: for example, using Google’s rich snippets test tool or, say, in a generic Web of Data browser.
Ok, so finally the IEEE Internet Computing article on Exploiting Linked Data to Build Web Applications is available. Though this is a nice first step, much more is needed to advance the field. The goal is to enable people to actually use linked data in their Web applications, rather than ‘only’ publish datasets. Don’t misunderstand me here: it is a good thing to publish on the Web of Data, but ultimately data is meant to be used somewhere, right? Publishing linked data is not an end in itself.
To support this effort, I’m currently compiling a technical report in and for DERI’s Linked Data Research Centre (LiDRC) that looks at current examples of linked data-driven Web applications, gathers good practices and discusses the anatomy of a typical application (in the last part of the report issues and challenges are discussed, as well). So, one of the central contributions is a proposed concept for linked data-driven Web applications, which renders as follows:
The proposed components read as follows:
I’d be happy to hear from you what you think about this proposal. Any architectural feedback is welcome!
So Paul asked recently: Does Linked Data need RDF? If you drink a certain sort of coffee, I guess you are familiar with my answer: What else? ;)
Seriously. Let’s step back for a second and try to work through to the core of the issue from a totally different angle.
Compare a set of predefined, fixed terms for certain domains, easy to use, etc. with a flexible and generic (hence, maybe, a bit more initial effort required) approach for annotating data, that is structured data on the Web. Sounds familiar? You’re right. I assume that you are aware of the old discussion around microformats vs RDFa, right? So, there we go …
Now, if one looks closer into the HTML 4 spec, one finds a bunch of link types, such as next, help, section, etc.; I’m gonna pick two, IMO, important sentences from there:
User agents, search engines, etc. may interpret these link types in a variety of ways. For example, user agents may provide access to linked documents through a navigation bar.
Ah, so the targeted consumer of the link is indeed a machine, not a human in the first place. Further:
Authors may wish to define additional link types not described in this specification. If they do so, they should use a profile to cite the conventions used to define the link types.
Ok, so there is a sort of extensibility mechanism defined in the HTML 4 spec as well. Very well! Or?
An analogy might help now to understand the point I’m trying to drive home, here. If you think back to microformats vs. RDFa, the same can be said about HTML 4 link types vs. RDF(a) …
HTML 4 link types as of section 6.12 of the spec are essentially the poor man’s semantic links, directly available in HTML. They are targeting machines (not human users in the first place), but are predefined in a sense and quite limited.
If you agree up to here by and large, then the question is really: what is the alternative? What technology out there, deployed, with community support, a set of tools available, etc. is available to represent, in a generic way (needed to write generic parser), any sort of typed link between two entities on the Web?
RDF.
What else? ;)
Note: credits go out to Juergen Umbrich with whom I discussed that issue yesterday evening and who inspired me writing the post …