//
you're reading...
FYI, Linked Data

Synchronising dataspaces at scale

So, I have a question for you – how would you approach the following (engineering) problem? Imagine you have two dataspaces, a source dataspace, such as Eurostat with some 5000+ datasets that can take up to several GB in the worst case, and a target dataspace (for example, something like what we’re deploying in the LATC, currently). You want to ensure that the data in the target dataspace is as fresh as possible, that is, providing a minimal temporal delay between the contents of source and target dataspaces.

Don’t get me wrong, this has exactly nothing to do with Linked Data, RDF or the like. This is simply the question of how often one should ‘sample’ the source in order to make sure that the target is ‘always’ up-to-date.

Now, would you say that Shannon’s theorem is of any help? Or, you look at the given source update frequency and decide based on this how often you hammer the server?

Step back.

It turns out that one should also take into account what happens in the target dataspace. In our case this is mainly the conversion of the XML or TSV into some RDF serialisation. This is, in cases where the source dataset has, say, some 11GB, a non-trivial issue to address. In addition, we see some ~1000 datasets changing in a couple of days time. Which would leave us, in the worst case, with a situation where we would still be in the conversion process of parts of the dataspace while already updated versions of the datasets would be pending.

On the other hand we know, based on our experience with the Eurostat data, that we can rebuild the entire dataspace – that is, downloading all 5000+ files incl. metadata, converting it to RDF and loading the metadata into the SPARQL endpoint – in some 11+ days. Wouldn’t it make sense to simply only look at the update every 10-or-so days?

We discussed this today and settled for a weekly (weekend) update policy. Let’s see where this takes us and I promise that I keep you posted …

About these ads

About mhausenblas

Chief Data Engineer EMEA @MapR #bigdata #hadoop #apachedrill

Discussion

10 thoughts on “Synchronising dataspaces at scale

  1. We do this, admittedly with a smaller number of datasets, by polling for changes quite frequently. Every five minutes, usually. We transfer only the resources that have changed, using the SDshare protocol. The RDF part of that protocol is not published, but we’re working on it. This works very well for us.

    Posted by Lars Marius Garshol | 2012-02-14, 08:50
  2. Lars-Marius,

    Thanks a million re the hint on SDshare – you’re talking about http://www.egovpt.org/fg/CWA_Part_1b or?

    So, to be a bit more precise re our setup: we exactly know when they do updates (twice a day) and also, via the ToC [1] we can determine which datasets have changed (they are timestamped with latest-updated).

    However, determining the sweet-spot between how often to download them from the source and then run the conversion (that’s the tricky part, really – take these numbers together and you’ll experience that converting a 11GB SDMX/XML file to RDF doesn’t happen instantaneously) is something yet to be done …

    Cheers,
    Michael

    [1] http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?sort=1&file=table_of_contents.xml

    Posted by mhausenblas | 2012-02-14, 09:02
  3. Yes, that’s the spec I’m talking about. We’re working on a new version to be published elsewhere.

    Ah, I see. If this is the level of access you have, then it gets tricky. We have direct SQL access to the underlying dataset, and can therefore expose SDshare feeds with only the actual changes. In your case you’d have one SDshare collection per dataset with a fragment and snapshot feed for each dataset. But, again, if this is the only access you’ve got, then …

    Posted by Lars Marius Garshol | 2012-02-14, 09:08
  4. …just thought of: measure the change frequency in the source datasets per dataset and apply an individual refresh procedure: this means to update often changing datasets fast while reducing transfer & traffic for slowly changing datasets.

    (given that an update to one dataset can be isolated as a single task)

    You might also use this kind of handling on the receiving side: you could decide on update frequency based on usage frequency. (the more often a dataset is used the better is the actuality of the information)

    Posted by Daniel Koller (@dakoller) | 2012-02-14, 11:28
  5. This is exactly the problem that the “RDF Pipeline” approach that I’ve been describing addresses. Here are my slides from the last Semantic Technology Conference in San Francisco:

    http://dbooth.org/2011/pipeline/

    It is a distributed, decentralized approach for data production pipelines and provides automatic caching and updating according to specifiable update policies. Although it was inspired by RDF needs, it is completely data agnostic.

    I have started developing this as an open source project on google code:

    http://code.google.com/p/rdf-pipeline/

    Thus far it is only at the proof-of-concept stage, but I hope to get it ready for early production use soon. I will be speaking again about it at the upcoming Semantic Technology Conference in San Francisco:

    http://semtechbizsf2012.semanticweb.com/

    Posted by David Booth | 2012-02-16, 15:06
  6. The Correlation Technology Platform (www.correlationconcepts.com) can easily accommodate :

    a) the size of the information store you are talking about (dozens of gigs)
    b) different file formats and locations
    c) frequent updates

    as well as, the things you did not ask about

    d) ability to drive the correlation process to closure, each and every time the infobase (collection of knowledge fragments, decomposed from all source stores) is process by a user query
    e) support response times to your requirements, and
    f) support as many users , concurrently, as you wish

    On the web site is a form you fill out – and we tell what hardware provisioning needs to take place, in order to get you what you want

    Carl Wimmer

    Posted by Carl Wimmer | 2012-02-16, 15:14
  7. What about doing usage-driven updates? One could monitor the usage being made of the data in the target data-space and base the updates on this. This requires one full update to load the target data space with all the data that can be provided, followed by as many updates as needed to keep the data that is actually used fresh. Nobody will notice that the rest of the data is outdated because nobody will use it ;-)

    Posted by Christophe Guéret | 2012-02-20, 07:25
    • Christophe:

      You have a number of problems (all of which can be solved).

      1) you have diverse stores of information – each no doubt in its own file format

      2) size is no issue – that is “just hardware” once you have the right system

      3) once you get it all in one place and “in one space” (meaning it all looks the same) all you need is a process for processing the new base (info base) against queries – which you certainly want to to be N-Dimensional

      4) you, can either drop off “old data” at set intervals, via a systems layer, or you can simply include because … “you never know”.

      5) and you want answer sets that are exhaustive every time you run the process , so you can know, absolutely kow, that there is nothing to find that has not already been found.

      so … you look at the web site below, and read about Correlation.

      and you have the answer for how to do it today.

      The entire system is, of course, protected by patents issued and pending in Europe, so if you are going to use this approach, please deal with us.

      Carl Wimmer

      Posted by Carl Wimmer | 2012-02-20, 15:58
  8. Carl: I’m sorry my remark triggered such a feeling of being attacked. No worries, there is no need to send your lawyers after me, I was just willing to contribute to the discussion that was started here and that I found interesting. Maybe your system is the perfect answer to the question raised, maybe it isn’t, I must admit that the website you point to is so unreadable that I just didn’t read it.

    Christophe Guéret

    Posted by Christophe Guéret | 2012-02-20, 23:42

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Archives

Follow

Get every new post delivered to your Inbox.

Join 2,150 other followers

%d bloggers like this: