//
you're reading...
Big Data, Cloud Computing, Idea, Linked Data, NoSQL

Denormalizing graph-shaped data

As nicely pointed out by Ilya Katsov:

Denormalization can be defined as the copying of the same data into multiple documents or tables in order to simplify/optimize query processing or to fit the user’s data into a particular data model.

So, I was wondering, why is – in Ilya’s write-up – denormalization not considered to be applicable for GraphDBs?

I suppose the main reason is that the relationships (or links as we use to call them in the Linked Data world) are typically not resolved or dereferenced, which means traversing the graph is fast, but for a number of operations such as range queries, denormalized data would be better.

Now, the question is: can we achieve this in GraphDBs, incl. RDF stores? I would hope so. Here are some design ideas:

  • Up-front: when inserting new data items (nodes), immediately dereference the links (embedded links).
  • Query-time: apply database cracking.

Here is the question for you, dear reader: are you aware of people doing this? My google skills have failed me so far – happy to learn about it in greater detail!

About these ads

About mhausenblas

Chief Data Engineer EMEA @MapR #bigdata #hadoop #apachedrill

Discussion

4 thoughts on “Denormalizing graph-shaped data

  1. This sounds for me more like combining RDBMSs* with GDBMSs (especially Triple Store), which looks like a really reliable and applicable architecture today.

    *) you could maybe replace the RDBMSs also with appropriated NoSQL systems, e.g., document databases (again, the BBC is doing it in this way, or?)

    Posted by zazi0815 | 2012-09-20, 19:53
  2. While we don’t do this systematically, we often do this in the case of specific applications. To simplify analytics, we’ll often use rules to denormalize selected parts of the graph as data is insert. Because of the flexibility of the graph data model, we usually don’t worry about the data living in two places which is nice when some users are performing analytics that require certain performance while others are walking the data by following links within it.

    Posted by Lee Feigenbaum | 2012-09-23, 12:17
  3. I can’t remember any specifics, but I’ve certainly seen it a few times. It’s kind of like with relational databases: you see that, for certain queries that will be coming up a lot, the levels of indirection necessary to satisfy these queries (JOINs for SQL, hopping multiple triples in RDF) are slowing things down, so you copy data somewhere else to reduce the number of hops. In triples, if you need to look up d from a and {a x b}, {b y c}, and {c z d}, you do a lot of the lookups in a batch and then create new {a v d} triples to make the a -> d lookups go more quickly. Of course, you have update issues to think about (they were normalized for a reason!), but ultimately, it’s about thinking ahead to help out the queries that will happen often.

    Posted by bobdc | 2012-09-23, 12:54
  4. We use a similar technique in our system, in this case the objective was to reduce the latency of some queries, as a testbed we used the English DBpedia: http://dbpedia.oobian.com/

    Posted by Miguel Grade | 2012-10-23, 15:24

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Archives

Follow

Get every new post delivered to your Inbox.

Join 2,151 other followers

%d bloggers like this: