Sunday 4 March 2012

Big Data


The other day I watched the Oracle Big Data forum. Now available here. A half-day event with various speakers on the subject of BigData, including Tom Kyte , a mentor who I admire!

In the forum they have gone over Oracle's approach to Big Data and allow me to summarise it below:
  1. Acquire - Collect Big Data, identify it, where is it? Then store it in Oracle NoSQL - a value-pair database

  2. Organise - Stage Big Data in a transient elastic database. Using Oracle Data Integrator and the Oracle Hadoop connector, reduce and distil it.

  3. Analyse - Start Analytics on the now acquired (reduced/distilled) and organised Big Data, using variety of Oracle Tools, like Language R, pattern chasing etc.

  4. Decide - Present your data to the decision makers with dashboards, back into a relational database etc...
Looking at the summary above, it really describes Big Data as something as ... a distillation of a massive amount of data ... two questions come to my mind:
  • Where is Big Data?

  • Why do we want Big Data?
Answering the second question is easy. We want Big Data because is all about having detailed information to allow us to make better decisions. Classical "Your Boss's Decision Making" use of data, if you work for business. If you work in astrology probably the question will be: "Is there life in space?"

The first question, "Where is Big Data?" I think is the one which will probably take more time and effort to answer. Or maybe if we slightly re-write the question to "Where can I find Big Data useful for my Business?" would be even more correct to say, as Big Data must make money, too. In a summary, this was what Oracle said, or at least what I understood from it. Very interesting indeed. Oracle looks at Big Data as another data source which is cool too.

Below is my rumblings on the topic of Big Data

Let me start by saying that, there were many interesting examples used to describe Big Data in the Oracle forum. For some Big Data is things such as the heartbeats of patients collected. For others, as data collected from petroleum pipe sensors on oil rigs. Other descriptions saying, that BigData is sensor data, accumulated during flights (apparently 7TB on a flight between London and New York) and more Big Data is Machine Data or even Big Data is Dark Matter!

Whatever we call it, Big Data is massive! A beast which is usually always truncated, due to lack of space in a relational database - rolling window. A beast which we can not collect in a relational database. A beast, which has to be written quickly as its generation is too fast for a relational database table to catch up, or its schema is too constraining. A thing which cannot be visualised in the I/O of a single system kit, no matter how expensive and grand that system kit is.  Big Data seems like can only live in a cluster of system kits spread out and wide from Australia to the North Pole?

Wait a minute, does this beast fit in a NoSQL value pair database! Bob is your uncle! It fits somewhere, then. But where? In NoSQL. A loose database with no Constraints - where constraints and the whole database logic is in the hands of the web app developer. I am not defending relational databases here, just arguing on the new NoSQL theory. NoSQL, a database which is not based on the sound and proven mathematics of Relational Theory, Calculus and Algebra, a database which is hardly transactional and not consistent (eventual consistency) and integral. Hmmm.... Whatever, as long as Big Data fits in a NoSQL database and we can capture it, is fine. Capture now, analyse later. Just store and spread it out. Then bring in all the metal and CPU and memory in the world (in parallel) to crunch this. What a Great idea!

I am persuaded by Big Data. Not all data needs to be stored relational, and ACID is a theory of transactions and not a must have database property.

What bugs me is, now that we are taking away all constraints and logic from database data and allow it to grow to Big Data, when we look back at it, how quickly will we be able to relate other data to it and how easy will it make sense? I am just looking forward to any sort of easy to use prompt (like Pig and Hive) where I can write something like:

 Select * from BIGDATA;

but in the cloud, as I will never be able to afford the metal to host big data anyway, should be fun.