This is one of the big questions when you start your first Hadoop project. Hadoop is Hadoop, right? So it should not matter which distribution you use? There is some truth is there, but there still are quite a few differences between these vendors, worth knowing about. After all, Linux is Linux, right? Debian or Redhat should not matter? You can just straight away to the quick answer, or carry on reading for more details.
If you want to know more about Hadoop itself, you can check out the official Apache site, or just the wikipedia page for history and so on.
There are 3 big Hadoop distributions. Apache Hadoop itself, the root of them all, is not a distribution per se, as you can download each components individually but a lot of elbow grease is needed to tie everything together. The 3 main vendors bundle Apache Hadoop with other tools, open source as well as their own proprietary bricks to create distributions. Those are Cloudera, MapR and Hortonworks. There are other vendors as well, Microsoft (HdInsight, cloud only), Pivotal (Pivotal HD) and other I forget, but I concentrate on the big 3 here.
Use MapR if:
- Performance is paramount,
- You are a big company with strong audit requirements,
- You know you will pay a licence for support.
Use Hortonworks if:
- Open source is very important to you,
- You do not want to pay for a licence but still want to do as much as possible (including security, authorisation),
- You already have a datawarehouse (Terradata, Oracle, Vertica…) that you plan to carry on using but could offload or which does not allow all processing you plan to do.
Use Cloudera if:
- You need to be PCI compliant
- You want as much as possible automated for you, at the potential cost of a licence
Longer answer and description
A generic comment first. If you already plan to use some specific tools or Linux distributions, make sure that they are compatible for your version. For instance Tez does not run on Coudera, Impala would have problems on Hortonworks, and MapR does not support Debian (but Ubuntu).
MapR biggest differentiators are its filesystem and database, said to improve a lot the overall performance because it is highly optimised and skips the jvm and ext4 layers, while still being compatible with HDFS and HBase APIs. Their filesystem is a real filesystem, not append-only as HDFS is, and can be mounted via NFS which makes some administration tasks much easier.
MapR strives to support the whole Hadoop ecosystem (for instance Tez, Impala, Spark…) which on paper means that more tools should be supported by MapR than by the other distributions.
MapR is the only one to support volumes, which can give you very strong security and multi-tenancy, as you can control with a very fine grain who can access which volume.
On the bad side, MapR is pretty limited in its free version.
HA for instance is only available with a licence. (EDIT: see comment from Anoop Dawar below, failover is now part of M3. the free version.)
As a nice starting point, you can spawn AWS instances configured for MapR, where the cost includes licence and support, without having to commit for a year. Usually AWS instances are about 2 months after the main MapR release due to extra testing and procedures.
Cloudera is the oldest Hadoop distribution. Their vision is to fully replace the warehouse by creating an Enterprise Data Hub and help the user a lot on the way.
The biggest strength of Cloudera is their automation. Cloudera manager and Navigator are amazing tools doing a lot for you, and are said to be superior to the equivalent of other distributions. That said, they are closed source, and although the manager is available for free, the navigator (security, governance) is not.
Another very strong point of Cloudera is Impala, a very fast open-source in-memory SQL database.
Cloudera is the only PCI-compliant distribution.
Cloudera claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.
Hortonworks vision is not to fully replace a warehouse, but to use existing warehouse to provide offloading or new processes, thanks to the integration with multiple partners.
Hortonworks is a fully open source distribution. There is no licence to pay, only support if you so wish. The definition of open source for Hortonworks is very strict. For them open source means managed by a committee to not have ‘dictatorial’ open-source, where a project is technically open source, but only one company can accept (and usually refuses) contributions.
Ambari is the management tool for Hortonworks. Although it is quite new and did not have all the features you would want from a manager, it is improving at great speed and is supported by multiple organisations, thanks to being open-source.
Hortonworks supports Debian, but with an extra 1-month delay due to extra tests needed in comparison with the standard Redhat/CentOS version.
Hortonworks claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.
This is always a big question, isn’t it? Here are a few prices I could gather. Those are just ballpark figures, and could of course be negociated.
MapR support (24/7) is around 4k$/server/year. This goes up to 6k if you want to include MapRDB as well. This include licence and support.
Cloudera support (24/7) is around 6.5k€/server/year. This includes server and licence. Note that Cloudera has multiple options, where you can elect to have full support (Enterprise), support for only one element (Flex) or support for only the core Hadoop, ie. HDFS, Hive and the like (Basic). Flex and enterprise provide the Navigator, but Basic is very cheap (500€/server/year).
Hortonworks does not provide a licence as it is fully opensource, but support (24/7) is priced at about 3.5k€/server/year.
This is usually a big concern, specially when talking about non open-source tools. I would claim that it is a non-problem.
Your data is always available via standard tools, and that is what matters the most. You will always be able to retrieve or export it in multiple ways. The rest (administration basically) is tied to your distribution anyway. If you do everything with the source Apache and puppet, use Ambari, Cloudera or MapR manager, it is not transferable to the other tool. In short, you are locked – administration-wise – anyway.
Very nice post. I’m personally not familiar with anything but Hortonworks and admittedly that is slim. Seems to me like the real differentiator between stacks isn’t support cost (their all in the same ballpark) rather quality of support.
I would love to see someone write up their honest dealings and opinions and compare them.
To add my 2 cents I’ve twice needed to get Hortonworks on a call. I’ve already admitted I’m a novice and just needed some basic configuration guidance and quick sanity check. In both cases hortonworks gave me the run-around trying to get out of providing support because my cluster was non-standard. I felt like I was dealing with someone trying to do CYA and lawyer-ing up, having to double-check the contract first. And at no time was I pushy or behaving in a way that warranted that kind of response. I didn’t appreciate that. When we finally got past that I felt that the support engineers’ knowledge of Storm and Kafka was just slightly more advanced than mine. That’s worrisome.
I can only compare that with couchbase. When you call them you are never more than 2 minutes removed from a knowledgeable engineer, gracious, and with a sense of urgency. And I’ve never been asked what my support contract details are, until the problem is resolved. I like that.
Thank you for your post and your effort to be unbiased. I’m from MapR, and in respect to your intent – I will refrain from posting about all the other cool features that I feel you missed or a post that talks about competitors, I would simply like to point out a few things about MapR that you called out where the facts have changed/or are different so that your readers get more information.
Based on customer request we added the following HA to the free version
– Ability to run HA (active/standby) (http://doc.mapr.com/display/MapR/CLDB+Failover). Motivated users have written their own script to make this easier. Additionally starting in 4.1 release the CLDB detection alarm in M3 gets issued within a minute of the failure reducing any potential downtime further
– YARN Resource Manager HA is also available on the M3 Community Edition
Although a license is required to run MapR, the M3 Community Edition is free. And if an Enterprise or Enterprise Database Edition customer doesn’t renew the license then the system simply falls back to M3 Community Edition -and therefore your data is never locked and you can continue to benefit from the features there (https://www.mapr.com/products/mapr-distribution-editions).
Thanks for your correction, I updated the post.