//
you're reading...
Big Data, Cloud Computing, Linked Data, NoSQL, Proposal, W3C

Ye shall not DELETE data!

This is the first post in the solving-tomorrow’s-problems-with-yesterday’s-tools series.

Alex Popescu recently reviewed a post by Mikayel Vardanyan on Picking the Right NoSQL Database Tool and was puzzled about the following of Mikayel’s statement:

[Relational database systems] allow versioning or activities like: Create, Read, Update and Delete. For databases, updates should never be allowed, because they destroy information. Rather, when data changes, the database should just add another record and note duly the previous value for that record.

I don’t find it puzzling at all. As Pat Helland rightly says:

In large-scale systems, you don’t update data, you add new data or create a new version.

OK, I guess arguing this on an abstract level serves nobody. Let’s get our hands dirty and have a look at a concrete example. I pick an example from the Linked Data world, but there is nothing really specific to it – it just happens to be the data language I speak and dream in ;)

Look at the following piece of data:

… and now let’s capture the fact that my address has changed …

This looks normal at first sight, but there are two drawbacks attached with it:

  1. If I ask the question: ‘Where has Michael been living previously?’, I can’t get an answer anymore once the update has been performed, unless I have a local copy of the old data piece.
  2. Whenever I ask the question: ‘Where does Michael live?’ I need to implicitly add ‘at the moment’, as the information is not scoped.

There are few ways one can deal with it, though. And as a consequence, here is what I demand:

  • Never ever DELETE data – it’s slow and lossy; also updating data is not good, as UPDATE is essentially DELETE + INSERT and hence lossy as well.
  • Each piece of data must be versioned – in the Linked Data world one could, for example, use quads rather than triples to capture the context of the assertion expressed in the data.

Oh, BTW, my dear colleagues from the SPARQL Working Group – having said this, I think SPARQL Update is heading in the wrong direction would benefit from adding an appendix that discusses ‘large-scale deployment considerations’ on a system-level. Can we still change this, pretty please?

PS: disk space is cheap these days, as nicely pointed out by Dorian Taylor ;)

About these ads

About mhausenblas

Chief Data Engineer EMEA @MapR #bigdata #hadoop #apachedrill

Discussion

12 thoughts on “Ye shall not DELETE data!

  1. Hi Michael,

    I don’t see any incompatibilities between SPARQL Update 1.1 and versioning. Before deleting your statements (a feature that people asked), you can copy them in a versionned graph. That’s one of the motivation for MOVE (one of the feature we’re looking for feedback, so yours would be welcome for this !)

    E.g.

    MOVE TO

    INSERT INTO { some_new_statements }

    And in case you move again

    MOVE TO

    INSERT INTO { some_new_statements }

    You can thus keep your location history in these mh-location-vx graph, and can also add additional statements (e.g. with dc:created) to indicate the creation date of the graph. Then, obviously it depends if you deal with a triple or quad store, but my guess is that most of them now support quads.

    HTH

    NB: In case you have formal requirements / objects, we’re in LC, so please e-mail the list

    Posted by Alex. | 2011-05-29, 10:05
    • Alex,

      Thanks a lot for the speedy reply, very much appreciated! Maybe I didn’t make it as explicit as it should have been in the first place – sorry, I thought this is obvious, my bad …

      So, I agree with you that SPARQL Update and versioning are compatible, but that’s not the point. What I really meant is: it’s slow and unnecessary to delete data.

      Compare with the example I gave: in the current setup a SPARQL engine has to roughly do the following: i) find the node to be updated (OK, that’s rather cheap in graph databases such as Neo4j, but sharding the graph is still subject to research, AFAIK), ii) remove the ‘old’ link(s) and node(s), iii) insert the new link(s) and node(s), and iv) potentially updating some indices.

      Contrast this with the single operation of simply adding the new new link(s) and node(s) along with the context (such as the time-range where the data piece is valid, in my example).

      Beside the fact that DELETE (and UPDATE, as pointed out in the post) is lossy, it’s slow on a large scale. This is my entire point.

      I guess, given the status of the WD it would probably cause a lot of pain if I issue a formal objection, but one could think of writing an additional WG Note that doesn’t go REC Track – sorta implementation or usage advices. If you think this makes sense, I’m happy to join the SPARQL WG for this activity – let me know ;)

      Cheers,
      Michael

      Posted by mhausenblas | 2011-05-29, 10:23
  2. Agree – versioning, and in many cases the management of both alternatives and versions is absolutely key to solve today’s and tomorrow’s problem! I have found the facade pattern, one of the software engineering design pattern, highly useful to address this issue. We applied it on top of the metadata registery model (ISO11179) for a clinical data. (see http://www.slideshare.net/kerfors/designing-and-launching-the-clinical-reference-library) The current implementation uses a traditional relational database approach. I would be very interested to learn how this software pattern can be applied for both data and for metadata in the Linked Data world :-)

    Posted by kerfors | 2011-05-29, 10:38
  3. Hi,

    I completely agree with your point: data should never be deleted. For example, in SAP accounting, payments are reversed by posting another payment reversing the first one (with a reference to the initial payment).

    I think that a distinction should be made between
    - the versioning of data
    - the effectivity of data

    The versioning of data can be dealt with a timestamp added to every RDF resource. To my opinion, this timestamp should be technical and not appear with a new triple or in a quad. However, query languages should be able to deal with it.

    The effectivity of data can be dealt with a date or date/time. To my opinion, this information could appear as a quad or a new triple.

    In your example, when you change your address, the new triple is not created while doing the move. It is created in advance (the day before) in prevision or the day after, after the move is done and the computer is reconnected to internet. Depending on when the RDF triple is created the timestamp will be different, but the effective date will be the same.

    Further reading: http://martinfowler.com/eaaDev/Effectivity.html

    Best regards, Manuki-san

    Posted by Manuki-san | 2011-05-29, 10:45
  4. Hi Michael,

    In my experience, you’re conflating a system-level view of data with an application-level view of data, and the two ways of looking at data often have very different needs.

    Versioning of data & never removing nodes for performance reason are often system-level concerns. Deleting data (or updating data) is a very very common application-level operation. The two are not at all mutually exclusive.

    In Anzo, users (or applications) are free to delete, update, or do whatever they want with massive amounts of data. In the implementation (i.e. at the system-level), actual data is not removed from disk, from indexes, etc. The data is versioned and deleted data is marked as no longer valid, and new data is added to the underlying journal files.

    In this context, SPARQL 1.1 Update is a very useful tool for the application developer, and I don’t think I really see a motivation for discouraging its use.

    Lee

    Posted by Lee Feigenbaum | 2011-05-29, 17:14
    • Lee,

      Thanks a bunch for your comment! Indeed, you’re right concerning the system vs. application-level POV. I guess this is why I suggested to call the respective advice ‘implementation/usage advices for large-scale deployments‘ rather than suggesting to change the update language itself. If this was not clear up-front, I hope this clarifies it now, yes?

      Cheers,
      Michael

      Posted by mhausenblas | 2011-05-29, 17:26
    • Michael,

      I still don’t understand what the advice would be. Perhaps you can give an example of what the advice would say? This seems to me to be no different than giving advice to people writing query engines for large scale relational databases that says “write your optimizer to prefer index lookups rather than table scans” — that is, it’s true, but it doesn’t seem like it’s the business of any sort of standards group to have anything to do with giving implementers advice on pretty standard fare…

      Posted by Lee Feigenbaum | 2011-05-29, 17:32
    • Lee,

      I don’t have the advice handy (yet) but I’m relieved to learn that this is ‘standard fare’. I’m looking for experiences of implementing SPARQL Update on a large cluster (say 100+ machines) dealing with high throughput and data in the billions+ triples scale. Can you provide me with some concrete examples, please?

      Cheers,
      Michael

      Posted by mhausenblas | 2011-05-29, 17:50
    • Hi Michael,

      No, I don’t have any specific examples. But just as SQL isn’t a tool of choice across hadoop clusters or what-not, I don’t see why the lack of existing 100+ node clusters that are running SPARQL Update is at all relevant to the utility or appropriateness of the SPARQL Update language. Use whatever tool is appropriate for the job, but if we felt obligated to go around defining the parameters under which it makes sense to implement each particular standard, I don’t think we’d ever get anything else done :-)

      Lee

      Posted by Lee Feigenbaum | 2011-05-30, 04:03
  5. Hi Michael,

    Well, there are various use cases for SPARQL 1.1 update
    “you never delete data” as you describe is one of them, updates in the spirit of traditional relational DBs another, and SPARQL1.1 update caters for both of them.

    This said, I fail to see what you mean by “SPARQL Update is heading in the wrong direction”.

    If you are up for defining a “no deletions” of the update language, that might be interesting for people solely interested in this use case, but I am not convinced that is in scope of the current SPARQL WG’s charter/goal.

    The way I see it, we want to get the full update language out now, a diversion into use cases that would need specialised “fragments” of the language will evolve based on this standard.

    Posted by Axel | 2011-05-29, 20:54
  6. Axel, Lee,

    I acknowledge the fact that if one wants to achieve something, one should probably not suggest that something is broken but rather propose a concrete change in a constructive way ;)

    I have hence updated the respective section in my original blog post above – I hope this clarifies my intention.

    To sum up, I propose the following: add an appendix ‘large-scale deployment considerations’ (from a system-level POV) to the SPARQL Update document to discuss performance and scalability issues concerning large-scale deployments (hundreds of nodes/tera-triples scale) with the intention to future-proof the spec a bit more.

    If the WG decides to go for this, I repeat my offer to contribute.

    Cheers,
    Michael

    Posted by mhausenblas | 2011-05-30, 08:31

Trackbacks/Pingbacks

  1. Pingback: Distributed Weekly 105 — Scott Banwart's Blog - 2011-06-03

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Archives

Follow

Get every new post delivered to your Inbox.

Join 2,150 other followers

%d bloggers like this: