Saturday, September 7, 2013

Big Data - Background from a Time Traveler

Count me in as the biggest cheer leader of Big Data bandwagon. I agree, it is very ironical if you know my background. Some will call me traitor as well, especially those that are more loyal to Oracle and other flavors of RDBMS than Oracle corp itself is. 

For years, even during the days in the Navy, I was Oracle's huge protagonist. Oracle DBAs could be singled out in the crowd - the guys who wore their attitude on the sleeves and a pony tail to back it up. Oracle and Uncle Larry could do nothing wrong. Those were really the heady days when Oracle was leading head and shoulder ahead of the rest of the pack.

To top it all, around 2005 or so - Oracle 10g made its debut and Oracle RAC (Real Application Cluster) grew beyond 9i's infamous oracm cluster software and into  a much more robust crs cluster software. Everyone was high on Oracle. Oracle could do everything. To the extent, that anything that could not be done by Oracle should not be even tried in the first place. Why? Because it simply didn't make sense. Oracle was the pied piper and most of us from database fraternity followed it blindly. Oracle could do no wrong. In historical terms, "Sun would never set on the Oracle empire".

However, soon we realized that Oracle was fallible, after all. We had a 130 TB data warehouse on Oracle and this is year of the Lord, 2006, when we suddenly started seeing SLA misses and performance issues. ETL jobs would take enormously long and then Reporting jobs would run endlessly. For more than six months, many ace DBAs, Network administrators, Storage Gurus, SysAdmins and Engineering architects broke their heads - many a times at the verge of killing each other but every time we brought the system to a good state, the data volume would push the envelope and we would see the SLA misses. It was as if our destiny was taunting and teasing us. The cycle would just play itself again - encore!

We were hoping Oracle development team will add more punch to the RDBMS software. But Alas, Oracle did almost nothing - they would not even take a look at DBMS_SCHEDULER (Oracle Native job scheduler). The scheduler was woefully short in handling special needs of data warehouses - no way to provide resources allocations, prioritize SLA over non-SLA jobs, re-nice job priority, and so on.

It almost felt that we were left with under-performing weaponry and left to fend for ourselves at the mercy of far superior enemy, the Big Data explosion.

While we were having challenges of our own and were in a very reactive mode, the world was changing around us. Jeff Dean, Sanjay Ghemawat et al decided to play spoil sport and worked out algorithm for Big Table, the distributed computing and storage framework for Google's Search indexing problem.

Since Yahoo was also in the business of Search (this well predates the search-deal-with-Microsoft days), Yahoo decided to look around and find its own man Friday. Doug Cutting turned out to be "da man". He was working on Nutch and was trying to implement his own version of Big Table, perhaps using the same schema family.

Doug Cutting, Mike Cafarella (then an intern but now a very famous professor in Univ of Michigan, Ann Arbor) and team were, meanwhile, working in the background on this alternative which would almost immediately provide panacea to all our problems. They took sometime to get their work out of the door when, Hadoop, as we know it, was born. Named after toy elephant that Doug's daughter had, Hadoop was, au contraire, no toy - it was the giant beast who easily bested Oracle even while it was barely born.

We created our Hadoop clusters in Yahoo - first as research and development grids and then production grids. The developers within Yahoo reacted to Hadoop's emergence as  if shepherds heard the news of Jesus' birth from Angels. Almost overnight (slight exaggeration), we saw most of the event level aggregation jobs moved to Hadoop grid. The baby elephant easily beat the hell out of seasoned Oracle software. The jobs that would take 9 hours on a 10 node Oracle data warehouse started completing in less than an hour. The developers had tasted the first blood, there was absolutely no turning back.

We could neither see nor understand what Hadoop was doing for us and the rest of the world. It had ripped off the ceilings, limits and shackles that relational databases had imposed on us for a long time. This freedom was most welcomed by those in analytic world, for they were emancipated suddenly. They were stunned by insights they could get from Hadoop based processing and data. The age of Big data was upon us. However, it would be sometime before the term "Big Data" would come into being.

Analytic fraternity suddenly realized that there was nothing to restrict their imagination except limitations of their own competence and flaws of their mind. The world of Analytics and ware houses would never be the same again.

What Jeff Dean, Sanjay Ghemawat, Doug Cutting and many others who followed them did - whether knowingly and deliberately or unbeknownst to themselves - they had unleashed upon us the age of Big Data. I would certainly say that these were the people who should be called the fathers of Big Data.

Meanwhile, we, the loyal Oracle DBAs,  were generally in denial or ignorant phase. What is this phase? It is believed that any new technology has at least two different reactions from its target user base. There are those that are new to both existing (aka older technology) and new technology or product. For them, new is better - they immediately take to new technology as fish takes to water.

The second user base is those who are deeply entrenched in older technology or product. For them learning something new is like giving up the handicap they have over these newbies and learn the new technology afresh. So there is a reluctance….Now let us take the view of those that are heavily invested in existing or older technology - these people hate the new technology - they had much rather that new technology just goes away and vanishes. Perhaps they go to sleep wishing that every single night. They are likely to pass through following phases:-

Phase I - Ignore the existence of new technology.
Phase II -Luke warm acknowledgement
Phase III -Continue to push old technology with more Ad/Marketing dollars
Phase IV -Create FUD (fear, uncertainty, doubt) related to new technology
Phase V - Run behind the charging caravan and see if they can catch up


Phases of Learning


The believers of old technologies actually end up providing use cases and some type of quality assurance testing for new technologies. How? They keep pointing to flaws in the new product and technologies and developers of the new technology continually remove and resolve those issues.

There were few enabling things happening at the same time when Big Data containers were being invented. Smartphones like Palm Treo, Blackberry followed by iPhone and Android devices were showing up. These smartphones and devices had started pushing the limits on data volume that was being generated. In turn, the  backend data containers started bursting at the seams. So to a time traveler, who is now blessed with the power and wisdom of hindsight, it would become clear that right at the moment these hadoop processing and data containers were being invented, people were already waiting with tons and tons of data to pour in. So, in somewhat naive sense -  Hadoop and its cousins like Big Table, DynamoDB, NoSQL systems - these systems and technologies were put in production before they could even enter beta phase.

Just so that I keep this discussion honest, I want to assure that I not moving the timelines even one bit. Around 2009, I saw the "light on the mountain" and got "converted" - we realized the power of these new technologies and what they were doing to analytics and insights world. Meanwhile, many in the world were still perceiving Big Data as a hype.

I suspect the dust settled around 2010 or so. People realized and surrendered to the real power of Big Data. It was also realized that this was not a hype like "Y2K" - it was hear to stay. Companies have not looked back in the race to push data into these new data stores since then. 

What are the different things being dumping into the Big Data containers? Let us take a look at a small sliver of the total data generated online (figures as of 20101, data courtesy - DOMO.com -http://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute )

  • Email users send more than 204 million messages
  • Mobile Web receives 217 new users
  • Google receives over 2 million search queries
  • YouTube users upload 48 hours of new video
  • Facebook users share 684,000 bits of content
  • Twitter users send more than 100,000 tweets
  • Apple receives around 47,000 application downloads
  • Brands receive more than 34,000 Facebook 'likes'
  • Tumblr blog owners publish 27,000 new posts
  • Instagram users share 3,600 new photos
  • Flickr users, on the other hand, add 3,125 new photos
  • Foursquare users perform 2,000 check-ins
  • WordPress users publish close to 350 new blog posts

We recognize all these companies as big data players - they are "spewing" all this data into some storage, somewhere so they can make better sense how people are using their products, refine those products, make them further attractive for users, in turn accept more data from more users and continue on the kaizen path or path of continuous improvement. Of course, this improvement comes with underlying vested interest of finding some way to better monetize the products - if not today, then at least some day in future.

This data generation is leading to "data pollution". Why do we call it data pollution? In the absence of great tools to store, process and make sense of this data, we are just involved in a good model of trucks of data loads being shipped from one system to another, one datacenter to another, one container to another. There is a hope in every mind that is involved in this transportation "value chain" that someone else knows how to extract value out of this data. There is also a faint hope that "somebody" knows extremely well what is going on.

In short, we are clueless but hope somebody better know how to make sense of this.

There is virtually a "data race". Data size is a good bragging point for storage folks, Database administrators and data community at large. Without good and meaningful analytic hovering over big data volumes, the big data is really of no use to man or beast. In fact, I would call it terribly expensive "garbage dump" if you are storing huge data but not getting meaningful and actionable insights from it.

I will cover further on Big Data enabling technologies that are the "partners in this crime" in my next blog.

No comments:

Post a Comment