Recently, Cloudera paid a visit to HQ and gave a presentation on their industry-standard implementation of Hadoop. I found it to be very insightful, especially given the paucity of details and proliferation of industry buzzwords that currently attaches itself to anything remotely related to “Big Data”.
I made some notes/observations during the presentation and did some follow-up reading afterwards. In case they’re useful, especially to those DBAs out there who might be expected to support Cloudera pretty soon, I typed up the notes and put them on here.
Supporting the System
Like with Oracle’s engineered systems, Cloudera’s ideal support model is for the customer to have a dedicated “data management” group to support the system.
In terms of “traditional” support teams, the system administrators were considered the “best fit” due to their Linux knowledge, though support tends to fall to the DBAs for the most part.
Cloudera provided 84% of all Hadoop training in 2013 and offers a variety of training courses.
Backup and Recovery
Backup and recovery is not viewed in the same way that relational DBAs view it today.
By default, data is copied to three locations on the cluster, mitigating (somewhat) against the loss of a component and avoiding (to an extent) data access issues that impact shared-nothing architectures such as Teradata when individual nodes are not available.
Nothing is expected to be overwritten in Hadoop. If a new “version” of the data is stored, it is stored in its entirety (no “deltas”) and old versions remain on the system until a specific delete operation is performed.
“Trash collections” can be enabled on all or parts of the data, allowing for an “undelete” similar to Oracle’s Flashback technology. This does add overhead to performance and should not be expected to provide a GUARANTEED ability to restore.
There may be cases where it is impossible to retrieve lost data.
I’ll let that sink in for a moment, especially for those of us who have been there and done that when OEM fires off one of THOSE alerts or you get one of THOSE calls from the operations team. You know which ones I mean.
Disaster Recovery and High Availability
A Hadoop cluster can span multiple data centers, allowing it to serve Production and Development at the same time, should this be desirable. Making three copies of the data offers some protection against data center issues, assuming that one of the copies is created on a node which is in a different data center.
To allow for isolation of environments and to protect against disasters such as the loss of an entire data center, Cloudera uses a third party called WANDisco to replicate specific segments of the data to a secondary site.
Security and Encryption
NoSQL databases do NOT provide the standard security features of a RDBMS, but nor are they intended to. Instead, Cloudera uses Linux’s operating system-level security to control access to the data on the Hadoop cluster.
NoSQL databases do not offer advanced RDBMS features such as encryption. However, as a result of Cloudera’s recent deal with Intel, Cloudera’s next major release (expected early 2015) will include encryption at the chip-level. This will ensure that data in flight and at rest will be encrypted in the Hadoop cluster.
Cloudera Product Suite
Cloudera uses the following products to manage and extract data on from the Hadoop storage:
• Map Reduce – batch processing and the underlying foundation for a lot of “newer” tools
• Impala – analytical SQL, “turning Hadoop into a MPP cluster”
• Sour – search engine
• Spark – machine learning
• Spark Streaming – stream processing
• Yawn – workload management
• Sentry – storage management
• Hadoop Filesystem (HDFS) – filesystem
• HBase – NoSQL database
Newer “data wrangling” tools are being released which allow developers and business users to perform “data discovery” directly on the Hadoop storage on a “self-service” basis.
It is crucial to identify potential use cases where implementing a Hadoop “data hub” would bring quantifiable and significant benefits, either through indirect savings – such as delaying the need to procure more expensive systems – or through additional capability.
This requires that the enterprise’s data strategy (including “Big Data”) has been determined and clearly articulates how, where and why data is to be captured, transformed and reported on to benefit the enterprise.
Contrary to industry hype, not all companies will need to add Big Data tools to their environment, though the vast majority of organizations could realize at least some benefit of deploying Hadoop.
Use cases will vary between businesses. One use case is to offload any staging data and related transformation processing from the existing EDW to the Hadoop data hub. This would free up storage space, processing power and memory on the more expensive EDW, thus saving the “expensive” resources for processing which legitimately requires it.
This is a misconception that Hadoop implementations and NoSQL databases are intended as cheaper replacements for existing RDBMS environments. This is not the case (at least, not yet) for a variety of reasons and, indeed, Cloudera do NOT advocate “ripping anything out” and replacing it with a Hadoop solution.
Instead, they consider Hadoop to be a “complement” to high-performance EDWs such as Exadata, acting as a data hub for the enterprise’s source data.
While relational databases are the proven best solution for a lot of data management needs, some types of processing – especially involving unstructured data – is better handled by NoSQL databases, querying unstructured data stores, such as Hadoop.
Cloudera are vendor-agnostic when it comes to database management systems. They are fully incorporated into Oracle’s Big Data Appliance and have strategic partnerships with Intel (hardware) and MongoDB (NoSQL).
Multiple IT teams will need to learn new skills (and training) to support these new systems. Moving from relational data to “Big Data” requires a paradigm shift in how data and databases are viewed by a number of teams: DBAs will have to learn to (occasionally) let go of traditional structures and accept that business users will have greater access to the data; developers and power users will to become data scientists and intimately know their data – where it is, what it consists of and how to retrieve it.
As a DBA, merging existing high-performance RDBMS systems and “Big Data” solutions designed for a smaller, more specialized set of tasks into a single, integrated data management infrastructure has the potential to be highly exciting and rewarding work.
If designed and implemented correctly, IT teams can create an extreme-performance data management environment, providing significant competitive advantages to the business, allowing for increased scalability and maximizing infrastructure investment.