I get asked frequently why one would not use a Storage Area Network (SAN) or Network Attached Storage (NAS) for Hadoop, so I thought I might take a moment and try to explain why you don’t need external storage of any type with Hadoop. Hopefully this posting will help you become more comfortable with the capabilities of Hadoop’s native features and optional modules and demonstrate that a SAN or NAS may not be necessary for you.
First of all, let’s look at the types of things you get in an Enterprise-class SAN or NAS as it pertains to what you might want with a Hadoop infrastucture:
- RAID (Redundant Array of Inexpensive Disks) – This gives you the ability to use SAS, SATA or SSD drives to storage your data in such a way that should a drive fail, you don’t lose data. This is very valuable and as such, any organization with any concern for their data should be using RAID if the application you are running and storage data on does not support disk or node failure.
- Replication (copying the volumes, data sets, etc., to another storage device) – This one is easy. If your building burns down, is your data safe.
- Snapping of data volumes that allow you to mine data while the production data is untouched. Basically…if I want to see what is happening with my data, I don’t necessarily want to hit live volumes as it could slow down production…so I can break a copy off and look at that without impacting performance.
Let me be clear here…there are absolutely times when using a Enterprise-class storage device makes perfect sense. But for Hadoop it is very much unnecessary, and it is these three areas that I am going to hit as well as some others that I hope will demonstrate that Hadoop works best with inexpensive, internal storage in JBOD mode. Some of you might say “if you lose a disk in a JBOD configuration, you’re toast…you lose everything”. While this might be true, with Hadoop, it isn’t. Not only do you have the benefit that JBOD gives you in speed, you have the benefit that Hadoop Distributed File System (HDFS) negates this risk. HDFS basically creates three copies of the data. This is a very robust way to guard against data loss due to a disk failure or node outage, so you can eliminate the need for performance-reducing RAID.
Secondly, with regards to replication is done in any number of ways…it really depends on your architecture. But the idea is this…you create Hadoop clusters that are either identical or pretty close to each other, then you use something like Flume, Scribe or Chukwa to load the clusters in parallel. Once that is done, the same tools keep the clusters synced up. Of my customers that run Hadoop, Flume seems to be the tool of choice.
Finally, we need to look at data mining, or data analytics (Big Data). Hadoop has sometimes received a bad rap because it does not do real-time analytics…you have to wait for things to “spin down” before you can get to that data. For that reason, many people would look at other proprietary software packages for that type of capability. With the advent of Storm (make sure to read THIS article and THIS article on Storm) you no longer have the problem of getting and being able to work with data real-time.
I hope this short article helps you understand that while SAN and NAS solutions are exceptionally valuable in the general enterprise, the Hadoop infrastructure not only does not require it, but also thrives without it. For more information, make sure to visit http://www.dell.com/hadoop.
Until next time, I’ll see you in the Cloud…