The value of large-scale datasets – stemming from IoT sensors, end-user and business transactions, social networks, search engine logs, etc. – apparently lies in the patterns buried deep inside them. Being able to identify these patterns, analyzing them is vital. Be it for detecting fraud, determining a new customer segment or predicting a trend. As we’re moving from the billions to trillions of records (or: from the terabyte to peta- and exabyte scale) the more ‘traditional’ methods, including MapReduce seem to have reached the end of their capabilities. The question is: what now?
But a second issue has to be addressed as well: in contrast to what current large-scale data processing solutions provide for in batch-mode (arbitrarily but in line with the state-of-the-art defined as any query that takes longer than 10 sec to execute) the need for interactive analysis increases. Complementary, visual analytics may or may not be helpful but come with their own set of challenges.
Recently, a proposal for a new Apache Incubator group called Drill has been made. This group aims at building a:
… distributed system for interactive analysis of large-scale datasets […] It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.
Drill’s design is supposed to be informed by Google’s Dremel and wants to efficiently process nested data (think: Protocol Buffers). You can learn more about requirements and design considerations from Tomer Shiran’s slide set.
In order to better understand where Drill fits in in the overall picture, have a look at the following (admittedly naïve) plot that tries to place it in relation to well-known and deployed data processing systems: