By Christo Lute, Director of Advanced Analytics,
Last week I used a simple metaphor to describe how different circumstances with data require different types of solutions. I said that a smaller database (at “bathtub scale”) and a large database or datahub (at “lake scale”) require us to use very different strategies in “moving the water,” or creating data solutions for business objectives. In the context of the metaphor, this seems obvious: you can’t move a bathtub of water and a lake of water the same way. But in terms of data, concretely, why would a large datahub not be usable in the same way that a small database is usable?
The answer lies in what makes Big Data Big, exactly. There’s a myth about Big Data that says it has properties smaller data sets do not. But that’s not quite right. When data reaches a large scale, nothing new happens, no magical qualities emerge from the void, but the qualities that were already existent yet manageable at the small scale become more problematic. These three qualities are known as the 3 Vs of Big Data: Volume, Variety, and Velocity.
With a smaller database (the bathtub), we can easily scoop up the data and move it to another bin. The scale of the bucket (hardware, software, or networking infrastructure) is much closer to the size of a whole bathtub. In real-life examples, this could be as simple as saying that we can fit a significant output of information into a single spreadsheet report. With a datahub, the buckets are no longer similar in volume. It would take many orders of magnitude more spreadsheets to report on the whole of the “lake.” Volume is an important quality because the orders of magnitude between the bucket and the body of water is greater in the datahub scenario.
Sometimes volume is not an issue. Perhaps you’ve created a special report that can hold sub-reports and allows you to capture the whole volume. But the variety of the data can make a big difference. In a small database, you may have 100 attributes you want to report on over 100 objects. In a datahub, it may be millions of attributes, sometimes inconsistent or in conflict with each other, over millions of entries. How do you serve up these differences? In our metaphor, bathwater is all water from the tap, but in the lake scenario it could come from rain, rivers, streams, and underground springs. When you scoop a bucket of water out of the lake, how do you know where it those water molecules most recently came from? This is the variety problem: how do you deal with this variety of sources? It’s worse still if the water types have different temperatures, mineral deposits, or cleanliness levels. The same is true with database types.
Finally, the problem of velocity plays a huge role. In the bathtub example, the velocity of water to be bailed out is at maximum the rate the tap can pour. The tap can even be turned off or only used occasionally. In the case of the lake, the water flows in constantly—sometimes at high-speed via a river or erratically and uncontainably via rain. In data, the speed of access of the data is a problem, but also the speed at which the data is collected can be a problem. It’s completely conceivable for a business to capture trillions of transactions a day, every single day. In the water analogy, this would be akin to a constant torrential downpour.
The myth of Big Data is that something scary happens when we have a lot of data in one place, and strange and dangerous problems arise out of this Big Data. But that’s false. All data have volume, velocity, and variety—these are fundamental qualities of data. It’s simply that when we collect that data in one place, when we collect a lot of it, at high speed and with unpredictable variation, we find that those fundamental qualities pose unusual barriers we don’t typically struggle with at smaller scales.
The three Vs are not to be feared any more than we should fear the dark. These problems can be solved with the right strategy, the right insight into the new landscapes that Big Data creates, and the right vision for how Big Data can be harnessed.