you're reading...
Linked Data, voiD

On the Effectiveness and Efficiency of Discovery

Imagine you’re running a research group of 100 people. You want to find out the expertise of your chaps and aggregate profiles. Sure, you can perfectly sit down and browse through tons of materials you have about your people. Browse through their homepages, project pages, subversion commits, blog posts, tweets, logs from IRC, and you name it. Then, for each person, you collect all the data found on the Web (or internal information sources) and dump it into a data bank of your choice (hum? you’re using MS Excel, never mind πŸ˜‰

This process might certainly be effective. You’ve gathered detailed information about 100 people and know precisely what they do and where they’re good at. Additionally, you’ve spent (or wasted?) 5000€ equivalent as it took you, say, a week? And I’m now just talking about gathering the data, not the tedious task of aggregating it nicely, formatting it properly so that you can use it to impress your sponsor.

Now, let’s imagine the same situation, but rather than you go and collect data, you ask people to provide their profiles themselves. All you do is set up a standardised form which contains fields for bio data, publications, projects, etc. and the people themselves provide this data by filling in the relevant fields. Then, after the deadline, you just press the ‘dump now’ button and voila, there you go …

Why am I telling this story? I guess this is mainly motivated by the fact that I am often faced with the question: why should one care about (using) voiD? With follow-your-nose (FYN), it is true that RDF offers a way to discover everything you like. If you’re not limited by time and/or budget. So, we note that this method is effective but NOT efficient,

To put it in other words, to a certain extent, FYN allows you to discover, gather and integrate all RDF-based data out there. It’s effective, but not very efficient. That is where voiD comes into play: people who have the data (or, at least know it very well πŸ™‚ provide a sort of summary of the dataset (regarding topics covered, license, vocabularies used, statistics on triples, interlinking, etc. as explained in the voiD guide). Then, all you need to do is operate on this summary. Using voiD, hence, for the task of discovery regarding the gathering, aggregation, and integration of data is effective and efficient, IMHO.


About woddiscovery

Web of Data researcher and practitioner


3 thoughts on “On the Effectiveness and Efficiency of Discovery

  1. Got a comment from http://twitter.com/mattroweshow via a direct message re ‘Why do you believe RDF to not be efficient?’

    So, Matthew, thanks for the question (and for spotting the typo as well πŸ˜‰

    Short answer: scalability. Remember the poor guy that has to query 100 people. Next day he’s requested to do the same for 1000 people, 10k people, and so on. Obviously, this doesn’t scale. However, with the second approach, the workload is constant; doesn’t matter much if you ask 100 or 10k people to provide the data, rest is automated. So, I guess the main argument here is to be able to deal with large amounts of data.

    I’m in the DI2 [1] unit at DERI and we aim to deal with TB++ data *on the Web*. voiD is, to my knowledge the only proposal so far on the table that allows to describe not only the content (from a 50km POV) but also how the data is interlinked with other data. I’m happy to learn that there are alternatives. Do you know others?


    [1] http://semanticweb.org/wiki/DI2

    Posted by woddiscovery | 2009-03-01, 10:58
  2. i am a complete voiD newbie so maybe this question already has come up somewhere else, but i couldn’t easily find an answer. the question comes in two parts:

    – why is voiD a schema and not a model? wouldn’t it be better to start an abstract model of how one assumes to find linked data on the web, and then propose a vocabulary for it? this way, there could be an RDF schema for the semweb community, but there also could be an XML schema for the rest of world also interested in linked data on the web.

    – what’s the relationship to XLink? it seems to me voiD is doing more or less the same, but with a less hypermedia-oriented focus. XLink also is just a vocabulary and not cleanly defined as a model, which is not all that great, but at least implicitly, both languages to linking, and when looking at their conceptual core (apart from voiD’s RDF focus and XLink’s XML focus), what’s the difference, what’s the overlap?


    Posted by dret | 2009-06-11, 22:34
  3. and, btw, OAI-ORE (http://www.openarchives.org/ore/) is another approach that does the same thing. not all that well-conceived, IMHO, but i have missed the latest updates and maybe it’s better now. and it might even be something for you because they are very RDFish, too. here are the things i wrote about it when the first draft became available:



    and there was some discussion on the o’reilly xml list (formerly xml.com) when it still was more than today’s one-man-show…

    Posted by dret | 2009-06-11, 23:13

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


%d bloggers like this: