Count me in as the
biggest cheer leader of Big Data bandwagon. I agree, it is very ironical if you
know my background. Some will call me traitor
as well, especially those that are more loyal to Oracle and other flavors of RDBMS than Oracle corp itself is.
For years, even during the days in the Navy, I was Oracle's huge
protagonist. Oracle DBAs could be singled out in the crowd - the guys who wore
their attitude on the sleeves and a pony tail to back it up. Oracle and Uncle
Larry could do nothing wrong. Those were really the heady days when Oracle was
leading head and shoulder ahead of the rest of the pack.
To top it all, around
2005 or so - Oracle 10g made its debut and Oracle RAC (Real Application
Cluster) grew beyond 9i's infamous oracm cluster software and into a much more robust crs cluster software.
Everyone was high on Oracle. Oracle could do everything. To the extent, that anything that could
not be done by Oracle should not be even tried in the first place. Why? Because
it simply didn't make sense. Oracle was the pied piper and most of us from
database fraternity followed it blindly. Oracle could do no wrong. In
historical terms, "Sun would never set on the Oracle empire".
However, soon we realized that Oracle was fallible, after all. We had a 130 TB data warehouse on Oracle and this is year of the Lord, 2006, when we suddenly started
seeing SLA misses and performance issues. ETL jobs would take enormously long and then Reporting jobs would run endlessly. For more than six months,
many ace DBAs, Network administrators, Storage Gurus, SysAdmins and Engineering
architects broke their heads - many a times at the verge of killing each other
but every time we brought the system to a good state, the data volume would push
the envelope and we would see the SLA misses. It was as if our destiny was taunting and teasing us. The cycle would just play itself again
- encore!
We were hoping Oracle development team will add more punch to the RDBMS software. But Alas, Oracle did almost nothing - they would not even take a look at DBMS_SCHEDULER (Oracle Native job scheduler). The scheduler was woefully short in handling special needs of data warehouses - no way to provide resources allocations, prioritize SLA over non-SLA jobs, re-nice job priority, and so on.
It almost felt that we were left with under-performing weaponry and left to fend for ourselves at the mercy of far superior enemy, the Big Data explosion.
We were hoping Oracle development team will add more punch to the RDBMS software. But Alas, Oracle did almost nothing - they would not even take a look at DBMS_SCHEDULER (Oracle Native job scheduler). The scheduler was woefully short in handling special needs of data warehouses - no way to provide resources allocations, prioritize SLA over non-SLA jobs, re-nice job priority, and so on.
It almost felt that we were left with under-performing weaponry and left to fend for ourselves at the mercy of far superior enemy, the Big Data explosion.
While we were having challenges of our own and were in a very reactive mode, the world was changing around us. Jeff Dean, Sanjay Ghemawat et al decided to play spoil sport and worked out algorithm for Big Table, the distributed computing and storage framework for Google's Search indexing problem.
Since Yahoo was also in the business of Search (this well predates the search-deal-with-Microsoft days), Yahoo decided to look around and find its own man Friday. Doug Cutting turned out to be "da man". He was working on Nutch and was trying to implement his own version of Big Table, perhaps using the same schema family.
Doug Cutting, Mike Cafarella (then an intern but now a very famous professor in Univ of Michigan, Ann Arbor) and team were, meanwhile, working in the background on this alternative which would almost immediately provide panacea to all our problems.
They took sometime to get their work out of the door when, Hadoop, as we know
it, was born. Named after toy elephant that Doug's daughter had, Hadoop was, au contraire, no toy - it was the giant beast who easily bested
Oracle even while it was barely born.
We created our Hadoop clusters in Yahoo - first as research and development grids and then production grids. The developers within Yahoo reacted to Hadoop's emergence as if shepherds heard the news of Jesus' birth from Angels. Almost overnight (slight exaggeration), we saw most of the event level aggregation jobs moved to Hadoop grid. The baby elephant easily beat the hell out of seasoned Oracle software. The jobs that would take 9 hours on a 10 node Oracle data warehouse started completing in less than an hour. The developers had tasted the first blood, there was absolutely no turning back.
We created our Hadoop clusters in Yahoo - first as research and development grids and then production grids. The developers within Yahoo reacted to Hadoop's emergence as if shepherds heard the news of Jesus' birth from Angels. Almost overnight (slight exaggeration), we saw most of the event level aggregation jobs moved to Hadoop grid. The baby elephant easily beat the hell out of seasoned Oracle software. The jobs that would take 9 hours on a 10 node Oracle data warehouse started completing in less than an hour. The developers had tasted the first blood, there was absolutely no turning back.
We could neither see
nor understand what Hadoop was doing for us and the rest of the world. It had
ripped off the ceilings, limits and shackles that relational databases had imposed on us
for a long time. This freedom was most welcomed by those in analytic world, for
they were emancipated suddenly. They were stunned by insights they could get
from Hadoop based processing and data. The age of Big data was upon us. However, it would be sometime before the term "Big Data" would come into being.
Analytic fraternity
suddenly realized that there was nothing to restrict their imagination except
limitations of their own competence and flaws of their mind. The world of
Analytics and ware houses would never be the same again.
What Jeff Dean,
Sanjay Ghemawat, Doug Cutting and many others who followed them did - whether
knowingly and deliberately or unbeknownst to themselves - they had unleashed
upon us the age of Big Data. I would certainly say that these were the people
who should be called the fathers of Big Data.
Meanwhile, we, the
loyal Oracle DBAs, were generally in
denial or ignorant phase. What is this phase? It is believed that any new
technology has at least two different reactions from its target user base.
There are those that are new to both existing (aka older technology) and new
technology or product. For them, new is better - they immediately take to new
technology as fish takes to water.
The second user base
is those who are deeply entrenched in older technology or product. For them
learning something new is like giving up the handicap they have over these
newbies and learn the new technology afresh. So there is a reluctance….Now let
us take the view of those that are heavily invested in existing or older
technology - these people hate the new technology - they had much rather that
new technology just goes away and vanishes. Perhaps they go to sleep wishing
that every single night. They are likely to pass through following phases:-
Phase I - Ignore the existence of new technology.
Phase II -Luke warm acknowledgement
Phase III -Continue to push old technology with more Ad/Marketing dollars
Phase III -Continue to push old technology with more Ad/Marketing dollars
Phase IV -Create FUD (fear, uncertainty, doubt) related to new
technology
Phase V - Run behind
the charging caravan and see if they can catch up
The believers of old
technologies actually end up providing use cases and some type of quality
assurance testing for new technologies. How? They keep pointing to flaws in the new
product and technologies and developers of the new technology continually remove and resolve those issues.
There were few
enabling things happening at the same time when Big Data containers were being
invented. Smartphones like Palm Treo, Blackberry followed by iPhone and
Android devices were showing up. These smartphones and devices had started pushing the limits on data volume that was being generated. In turn, the backend data
containers started bursting at the seams. So to a time traveler, who is now blessed with the power and wisdom
of hindsight, it would become clear that right at the moment these hadoop processing and data containers
were being invented, people were already waiting with tons and tons of data to
pour in. So, in somewhat naive sense - Hadoop and its cousins like Big Table, DynamoDB, NoSQL systems - these systems and technologies were put in
production before they could even enter beta phase.
Just so that I keep
this discussion honest, I want to assure that I not moving the timelines even one bit. Around 2009, I saw the
"light on the mountain" and got "converted" - we realized
the power of these new technologies and what they were doing to analytics and
insights world. Meanwhile, many in the world were still perceiving Big Data as a
hype.
I suspect the dust settled around 2010 or so. People realized and surrendered to the real power of Big Data. It was also realized that this was not a hype like "Y2K" - it was hear to stay. Companies have not looked back in the race to push data into these new data stores since then.
What are the different things being dumping into the Big Data containers? Let us take a look at a small sliver of the total data generated online (figures as of 20101, data courtesy - DOMO.com -http://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute )
I suspect the dust settled around 2010 or so. People realized and surrendered to the real power of Big Data. It was also realized that this was not a hype like "Y2K" - it was hear to stay. Companies have not looked back in the race to push data into these new data stores since then.
What are the different things being dumping into the Big Data containers? Let us take a look at a small sliver of the total data generated online (figures as of 20101, data courtesy - DOMO.com -http://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute )
- Email users send more than 204 million messages
- Mobile Web receives 217 new users
- Google receives over 2 million search queries
- YouTube users upload 48 hours of new video
- Facebook users share 684,000 bits of content
- Twitter users send more than 100,000 tweets
- Apple receives around 47,000 application downloads
- Brands receive more than 34,000 Facebook 'likes'
- Tumblr blog owners publish 27,000 new posts
- Instagram users share 3,600 new photos
- Flickr users, on the other hand, add 3,125 new photos
- Foursquare users perform 2,000 check-ins
- WordPress users publish close to 350 new blog posts
We recognize all
these companies as big data players - they are "spewing" all this
data into some storage, somewhere so they can make better sense how people are
using their products, refine those products, make them further attractive for
users, in turn accept more data from more users and continue on the kaizen path or path of continuous improvement. Of course, this improvement comes with
underlying vested interest of finding some way to better monetize the products
- if not today, then at least some day in future.
This data generation
is leading to "data pollution". Why do we call it data pollution? In
the absence of great tools to store, process and make sense of this data, we
are just involved in a good model of trucks of data loads being shipped from one
system to another, one datacenter to another, one container to another. There
is a hope in every mind that is involved in this transportation "value
chain" that someone else knows how to extract value out of this data.
There is also a faint hope that "somebody" knows extremely well what
is going on.
In short, we are
clueless but hope somebody better know how to make sense of this.
There is virtually a
"data race". Data size is a good bragging point for storage folks,
Database administrators and data community at large. Without good and
meaningful analytic hovering over big data volumes, the big data is really of
no use to man or beast. In fact, I would call it terribly expensive
"garbage dump" if you are storing huge data but not getting
meaningful and actionable insights from it.
I will cover further
on Big Data enabling technologies that are the "partners in this
crime" in my next blog.
No comments:
Post a Comment