Tuesday, 7 July 2015

Hadoop is the New Black

It feels like any SAS-related project in 2015 not using Hadoop is simply not ambitious enough. The key question seems to be "how big should our Hadoop cluster be" rather than "do we need a Hadoop cluster".

Of course, I'm exaggerating, not every project needs to use Hadoop, but there is an element of new thinking required when you consider what data sources are available to your next project and what value would they add to your end goal. Internal and external data sources are easier to acquire, and volume is less and less of an issue (or, stated another way, you can realistically aim to acquire large and larger data sources if they will add value to your enterprise).

Whilst SAS is busy moving clients from PC to web, there's a lot of work being done by SAS to move the capabilities of the SAS server inside of Hadoop. And that's to minimise "data miles" by moving the code to the data rather than vice-versa. It surely won't be long before we see SAS Grid and LASR running inside of Hadoop. It's almost like Hadoop has become a new operating system on which all of our server-side capabilities must be available.

We tend to think of Hadoop as being a central destination for data but it doesn't always start its presence in an organisation in that way. Hadoop may enter an organisation for a specific use case, but data attracts data, and so once in the door Hadoop tends to become a centre of gravity. This effect is caused in no small part by the appeal of big data being not just about the data size, but the agility it brings to an organisation.

SAS's Senior Director of the EMEA and AP Analytical Platform Centre of Excellence, Mark Torr (that's one heck of a title Mark!) recently wrote a well-founded article on the four levels of Hadoop adoption maturity based upon his experiences with many SAS customers. His experiences chime with my far more limited observations. Mark lists the four levels as:
  1. Monitoring - enterprises that don't yet see a use for Hadoop within their organisation, or are focused on other priorities
  2. Investigating - those at this level have no clear, focused use for Hadoop but they are open to the idea that it could bring value and hence they are experimenting to see where and how it can deliver benefit(s)
  3. Implementing - the first one or two Hadoop projects are the riskiest because there's little or no in-house experience, and maybe even some negative political undercurrents too. As Mark notes, the exit from Investigating into Implementing often marks the point where enterprises choose to move from the Apache distribution to a commercial distribution that offers more industrial-strength capabilities such as Hortonworks, Cloudera or MapR
  4. Established - At this level, Hadoop has become a strategic architectural tool for organisations and, given the relative immaturity of Hadoop, the organisations are working with their vendors to influence development towards full production-strength capabilities
Hadoop is (or will be) a journey for all of us. Many organisations are just starting to kick the tyres. Of those who are using Hadoop, most are in the early stages of this process in level 2, with a few front-runners living at level 3. Those organisations at leve 3 are typically big enough to face and invest in solutions to the challenges that the vendors haven’t yet stepped up to, such as managing provenance, data discovery and fine-grained security.

Does anybody live the dream fully yet? Arguably, yes, the internal infrastructures developed at Google and Facebook certainly provide their developers with the advantages and agility of the data lake dream. For most us, we must be content to continue our journey...