Cloudera Applies Open Source Philosophy to Big Data Analytics

These days it’s hard to read about Big Data or the business intelligence segment without seeing the name “Hadoop” come up, and with good reason. Hadoop is a software framework geared towards the creation of applications that can process amounts of data well into the petabytes. Being a project of the non-profit Apache Software Foundation, the framework is open source and has been integrated into several commercial products designed for Big Data analytics. One of the leaders in this niche is Cloudera, an enterprise who takes the Hadoop credo to heart by providing Big Data analytics solutions that are 100% open source.

It’s not surprising that Cloudera–whose business is a commercial redistribution of Hadoop and support for Hadoop products–would embrace open source so much. The company aligns itself closely with Apache’s Hadoop project team. In fact, the founders of the original Hadroop project, Doug Cutting and Mike Cafarella, are currently part of the Cloudera staff as architects and advisers.

Cloudera’s roots don’t trace back to Apache alone either. When the company got its start in 2008 it did so with pool of engineering talent that drew from former employees of several Silicon Valley players including Facebook and Google. Clouder’s ties with Google aren’t just in talent, but technology as well. Hadoop itself may be an Apache project, but its origins stem from a concept drafted by the now-defunct Google Labs.

Sharing the Big Data Workload

MapReduce began its life as an academic paper published in 2004 by Google. In the paper Google employees Jeffrey Dean and Sanjay Gemawat describe their concept for a model of big data processing.

In the MapReduce model, when a certain problem needs to be solved input data is handed over to a network of computers which act like nodes in a cluster. A master node sorts the input into subsets of similar data and distributes down to its subordinate nodes. Those subsets can again be broken down into smaller units and redistributed to more nodes. This is the “Map” part of the process.

In the “Reduce” part the answers to those smaller units of data are handed back up the line to the master segment, where they’re reanalyzed and synthesized into an answer to the original problem.

Dean and Gemawat’s paper became the inspiration for Cutting, at the time working for Yahoo, to begin work on an open source implementation of the concept. This eventually became the Hadoop framework that Cloudera handles today.

Open Source and Big Data

At first glance it might seem self-defeating for a company to base its entire business around a product whose open nature has allowed it to get picked up and redistributed by high profile competitors like IBM and Yahoo. Cloudera, however, aims to capitalize on the marketing of its Hadoop derivations to new business segments as well as positioning itself as a go-to resource for technical support and consulting on Hadoop software.

The Cloudera approach has at the very least found buyers among many investors. Technology-focused investment firms like Accel Partners and Meritech Capital as well as individual investors like Flickr co-founder Caterina Fake and Gideon Yu of Khosla Ventures have all contributed to a venture funding total of around $76 million for Cloudera.

The road ahead seems bright. In 2009 Thompson Reuters’ Venture Capital Journal named Cloudera the year’s most promising technology startup. Fears that Google might launch a reprisal against Cloudera over intellectual property issues regarding MapReduce were calmed when Google CEO Eric Schmidt gave his blessing to the company in 2008 and said its technology should be “pervasive.” Earlier this year the company announced a partnership deal with Oracle for collaboration on a new low-cost BI analysis appliance.

For more information on the Big Data industry, visit the business intelligence research page.

Mark Aspillera: Mark is former member of the Business-Software.com marketing team. He contributed interviews, profiles and analyses on relevant subjects in the business technology field.