you're reading...
Big Data, Cloud Computing, FYI, NoSQL

Interactive analysis of large-scale datasets

The value of large-scale datasets – stemming from IoT sensors, end-user and business transactions, social networks, search engine logs, etc. – apparently lies in the patterns buried deep inside them. Being able to identify these patterns, analyzing them is vital. Be it for detecting fraud, determining a new customer segment or predicting a trend. As we’re moving from the billions to trillions of records (or: from the terabyte to peta- and exabyte scale) the more ‘traditional’ methods, including MapReduce seem to have reached the end of their capabilities. The question is: what now?

But a second issue has to be addressed as well: in contrast to what current large-scale data processing solutions provide for in batch-mode (arbitrarily but in line with the state-of-the-art defined as any query that takes longer than 10 sec to execute) the need for interactive analysis increases. Complementary, visual analytics may or may not be helpful but come with their own set of challenges.

Recently, a proposal for a new Apache Incubator group called Drill has been made. This group aims at building a:

… distributed system for interactive analysis of large-scale datasets […] It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Drill’s design is supposed to be informed by Google’s Dremel and wants to efficiently process nested data (think: Protocol Buffers). You can learn more about requirements and design considerations from Tomer Shiran’s slide set.

In order to better understand where Drill fits in in the overall picture, have a look at the following (admittedly naïve) plot that tries to place it in relation to well-known and deployed data processing systems:

BTW, if you want to test-drive Dremel, you can do this already today; it’s an IaaS service offered in Google’s cloud computing suite, called BigQuery.

About mhausenblas

Distributed Jester, Mesosphere


7 thoughts on “Interactive analysis of large-scale datasets

  1. Great intro post for Dremel and Drill.

    I think batch-style data processing usually provides deeper and more sophisticated processing than “skim through / estimate” approach of Dremel systems. In other words Hadoop trades interactivity for deeper more sophisticated data processing while Dremel trades deeper analytics for interactivity and faster response.

    Pardon my immodesty, I would bring here couple of my links on the issue that I think are relevant:

    our two-year old, prototype-quality implementation of Dremel: http://code.google.com/p/dremel/
    The above page has also a couple of relevant blog links.

    My lengthy take on taxonomy of BigData systems: http://bigdatacraft.com/archives/135

    Hope to see more posts on the issue here…

    Posted by Camuel Gilyadov | 2012-09-02, 21:09
    • Camuel is dead-on. Dremel/Drill provide skimming summaries of big data rather than deep analysis. As such, Dremel and map-reduce make perfect partners.

      Posted by Ted Dunning | 2012-09-03, 19:26
    • Useful summary.

      Posted by viplav | 2012-10-20, 20:53


  1. Pingback: Quora - 2012-09-03

  2. Pingback: Link Roundup – September 4, 2012 | Enterprise Information Management in the 21st Century - 2012-09-04

  3. Pingback: Windows Azure and Cloud Computing Posts for 9/4/2012+ - Windows Azure Blog - 2012-09-05

  4. Pingback: MapR, Europe and me « Web of Data - 2013-01-01

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: