Hortonworks, Cloudera or MapR?

This is one of the big questions when you start your first Hadoop project. Hadoop is Hadoop, right? So it should not matter which distribution you use? There is some truth is there, but there still are quite a few differences between these vendors, worth knowing about. After all, Linux is Linux, right? Debian or Redhat should not matter? You can just straight away to the quick answer, or carry on reading for more details.

Generalities

If you want to know more about Hadoop itself, you can check out the official Apache site, or just the wikipedia page for history and so on.

There are 3 big Hadoop distributions. Apache Hadoop itself, the root of them all, is not a distribution per se, as you can download each components individually but a lot of elbow grease is needed to tie everything together. The 3 main vendors bundle Apache Hadoop with other tools, open source as well as their own proprietary bricks to create distributions. Those are Cloudera, MapR and Hortonworks. There are other vendors as well, Microsoft (HdInsight, cloud only), Pivotal (Pivotal HD) and other I forget, but I concentrate on the big 3 here.

Quick answer

Use MapR if:

  • Performance is paramount,
  • You are a big company with strong audit requirements,
  • You know you will pay a licence for support.

Use Hortonworks if:

  • Open source is very important to you,
  • You do not want to pay for a licence but still want to do as much as possible (including security, authorisation),
  • You already have a datawarehouse (Terradata, Oracle, Vertica…) that you plan to carry on using but could offload or which does not allow all processing you plan to do.

Use Cloudera if:

  • You need to be PCI compliant
  • You want as much as possible automated for you, at the potential cost of a licence

Longer answer and description

A generic comment first. If you already plan to use some specific tools or Linux distributions, make sure that they are compatible for your version. For instance Tez does not run on Coudera, Impala would have problems on Hortonworks, and MapR does not support Debian (but Ubuntu).

MapR

MapR biggest differentiators are its filesystem and database, said to improve a lot the overall performance because it is highly optimised and skips the jvm and ext4 layers, while still being compatible with HDFS and HBase APIs. Their filesystem is a real filesystem, not append-only as HDFS is, and can be mounted via NFS which makes some administration tasks much easier.

MapR strives to support the whole Hadoop ecosystem (for instance Tez, Impala, Spark…) which on paper means that more tools should be supported by MapR than by the other distributions.

MapR is the only one to support volumes, which can give you very strong security and multi-tenancy, as you can control with a very fine grain who can access which volume.

On the bad side, MapR is pretty limited in its free version. HA for instance is only available with a licence. (EDIT: see comment from Anoop Dawar below, failover is now part of M3. the free version.)

As a nice starting point, you can spawn AWS instances configured for MapR, where the cost includes licence and support, without having to commit for a year. Usually AWS instances are about 2 months after the main MapR release due to extra testing and procedures.

Cloudera

Cloudera is the oldest Hadoop distribution. Their vision is to fully replace the warehouse by creating an Enterprise Data Hub and help the user a lot on the way.

The biggest strength of Cloudera is their automation. Cloudera manager and Navigator are amazing tools doing a lot for you, and are said to be superior to the equivalent of other distributions. That said, they are closed source, and although the manager is available for free, the navigator (security, governance) is not.

Another very strong point of Cloudera is Impala, a very fast open-source in-memory SQL database.

Cloudera is the only PCI-compliant distribution.

Cloudera claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.

Hortonworks

Hortonworks vision is not to fully replace a warehouse, but to use existing warehouse to provide offloading or new processes, thanks to the integration with multiple partners.

Hortonworks is a fully open source distribution. There is no licence to pay, only support if you so wish. The definition of open source for Hortonworks is very strict. For them open source means managed by a committee to not have ‘dictatorial’ open-source, where a project is technically open source, but only one company can accept (and usually refuses) contributions.

Ambari is the management tool for Hortonworks. Although it is quite new and did not have all the features you would want from a manager, it is improving at great speed and is supported by multiple organisations, thanks to being open-source.

Hortonworks supports Debian, but with an extra 1-month delay due to extra tests needed in comparison with the standard Redhat/CentOS version.

Hortonworks claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.

Price comparison

This is always a big question, isn’t it? Here are a few prices I could gather. Those are just ballpark figures, and could of course be negociated.

MapR support (24/7) is around 4k$/server/year. This goes up to 6k if you want to include MapRDB as well. This include licence and support.

Cloudera support (24/7) is around 6.5k€/server/year. This includes server and licence. Note that Cloudera has multiple options, where you can elect to have full support (Enterprise), support for only one element (Flex) or support for only the core Hadoop, ie. HDFS, Hive and the like (Basic). Flex and enterprise provide the Navigator, but Basic is very cheap (500€/server/year).

Hortonworks does not provide a licence as it is fully opensource, but support (24/7) is priced at about 3.5k€/server/year.

Vendor lockin

This is usually a big concern, specially when talking about non open-source tools. I would claim that it is a non-problem.

Your data is always available via standard tools, and that is what matters the most. You will always be able to retrieve or export it in multiple ways. The rest (administration basically) is tied to your distribution anyway. If you do everything with the source Apache and puppet, use Ambari, Cloudera or MapR manager, it is not transferable to the other tool. In short, you are locked – administration-wise – anyway.

Advertisements

Graph the noise level in your office in 15 minutes

This is a recurrent complaint in any open space: “There is too much noise!” (the other one is that the climate is too cold/too hot). There are some usual culprits, but it is nice to have data to back your complaints up.

I will here show you how to generate a real-time noise level graph in 15 minutes without any material beside our laptop or desktop, not even a microphone is needed. This is a dirty hack, but it works and can be put in place very quickly with just a few command lines. The steps, which will be mostly cut&paste are:

  • install a noise recorder tool
  • set up nginx to serve the data recorded
  • use a nice javascript library to display the data properly

I used soundmeter, a python tool. So first, install it:

# make sure we can install python packages
apt-get install virtualenv
# install required dependencies for building soundmeter
apt-get install python-dev portaudio19-dev python-dev alsa-utils
# install other useful tools for later display
apt-get install nginx expect-dev
# set up a directory for the tool
mkdir $HOME/soundmeter
# create virtualenv
virtualenv $HOME/soundmeter
# activate it
source $HOME/soundmeter/bin/activate
# install soundmeter
pip install soundmeter

Et voilà, your recorder is setup.

But do you not need a microphone? Well, either you have a laptop with a build in microphone, either you can just plug an headphone, which is basically a microphone used the other way (produce instead of record sound).

To get data, a one-liner is enough:

soundmeter --segment 2 --log /dev/stdout 2>/dev/null | unbuffer -p perl -p -e 's/\s*(\d+)\s+(.{19})(.*)/"$2,". 20*log($1)/e' > meter.csv

The explanation is as follow:

  • soundmeter: run soundmeter forever (–seconds to limit the duration)
  • --segment 2: output data every 2 seconds (default 0.5 seconds, but is very spiky)
  • --log /dev/stdout: default data on stdout is not useful for graphing, we need to log to a file. Use /dev/stdout as file to actually log to stdout
  • 2>/dev/null: do not pollute output
  • |: the output is not in a great format, it needs to be reformatted
  • unbuffer -p: by default data is buffered, which is annoying for real-time view. This does what the name suggests
  • perl -p -e: yummy, a perl regexp!
  • s///e: this will be a substitution, where the replacement part is a perl expression
  • \s*(\d+)\s+(.{19})(.*): record value and timestamp stripped of the milliseconds
  • “$2,”: display first the timestamp with a comma for csv format
  • 20*log($1): the values from soundmeter are in rms, transform them in dB via the formula 20 * log (rms)
  • > meter.csv: save data in a file

In short, we do the following transformation on the fly and write it to a csv file:

2015-09-22 13:36:13,082 12 => 21.5836249,2015-09-22 13:36:13

You now have a nice csv file. How to display it? Via a nice html page with the help of a javascript library, dygraphs,of course.

Set up nginx by adding in /etc/sites-enabled/noise the following content (replace YOUR_HOME by your actual home directory, of course):

server {
 listen 80;
 root YOUR_HOME/soundmeter;
}

and restart nginx:

service nginx restart

Then setup you page in $HOME/soundmeter/noise.html:

<html>
<head>
<script src="//cdnjs.cloudflare.com/ajax/libs/dygraph/1.1.1/dygraph-combined.js"></script>

<style>
#graphdiv2 { position: absolute; left: 50px; right: 10px; top: 50px; bottom: 10px; }
</style>

</head>
<body>
<div id="graphdiv2"></div>
<script type="text/javascript">
 g2 = new Dygraph(
 document.getElementById("graphdiv2"),
 "http://localhost/meter.csv", // path to CSV file
 {
 delimiter: ",",
 labels: ["Date", "Noise level"],
 title: ["Noise (in dB)"],
 showRoller: true,
 }
 );
</script>
</body>
</html>

You can of course replace localhost by your IP to publish this page to your colleagues.

Now just go to http://localhost/noise.html:

noise

Indispensable tool of the day: thefuck

Thefuck is a magnificent app which corrects your previous console command.

Thefuck is a python tool, hosted on github, which looks at your previous command and tries to correct it. You of course invoke it after a typo for instance, by typing fuck in your terminal.

A few examples in image, from the developer himself:

thefuck in action

A few more examples:

Adds sudo for you:

~> apt-get install thefuck 
E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
~> fuck
sudo apt-get install thefuck [enter/↑/↓/ctrl+c]
Reading package lists... Done

Corrects your git branch syntax (this is my main use of thefuck):

~/gits/lomignet-apt> git push
fatal: The current branch newfeature has no upstream branch.
To push the current branch and set the remote as upstream, use

  git push --set-upstream origin newfeature

~/gits/lomignet-apt> fuck
git push --set-upstream origin newfeature [enter/↑/↓/ctrl+c]
Total 0 (delta 0), reused 0 (delta 0)
To git@github.com:lomignet/lomignet-apt.git
 * [new branch] newfeature -> newfeature
Branch newfeature set up to track remote branch newfeature from origin.

The installation is trivial as the package is uploaded to the python package index:

pip install thefuck

To then have the command available, add to your .bashrc or whichever your startup script is:

eval $(thefuck --alias)

In the background, thefuck has a set of known rules which you can find in the readme. If those rules are not comprehensive enough for you, you can write your own.

Restart Mongodb with not enough nodes in a replica set

The context could be a virtualised cluster, where an hypervisor went suddenly down. 2 of your Mongo replicas are unavailable, only 1 is left, which then of course drops back to being secondary and read only.

You want to have this server running alone for a while while the others come back online, as you decide that it is better to have potential small inconsistency instead of not running for a few hours. The thing is that this last server will complain that the rest of the set is not available. To get it started again, you just need to make it forget about the rest of the set.

  1. Switch the service off
    service mongodb stop
  2. Remove the line replSet from your /etc/mongodb.conf
  3. Restart the service
    service mongodb start

    Mongo will complain:

    mongod started without --replSet yet 1 documents are present in local.system.replset
     [initandlisten] ** Restart with --replSet unless you are doing maintenance and no other clients are connected.
     [initandlisten] ** The TTL collection monitor will not start because of this.
     [initandlisten] ** For more info see http://dochub.mongodb.org/core/ttlcollections
  4. Remove the offending document in system.replset from the mongoshell
    // will give you one document back
    db.system.replset.find()
    // remove all documents (there is only one)
    db.system.replset.remove({})
    // check resultset is empty
    db.system.replset.find()
    
  5. Restart mongo
    service mongodb stop
    service mongodb start
  6. Once the other nodes are up, add again the replSet line in /etc/mongodb.conf and restart the service.

New puppet apt module, now with better hiera!

The puppet apt module from puppetlabs works great, but has one big issue. You can define sources (and keys, settings and ppas) from hiera, but only your most specific definition will be used by the module, as only the default priority lookup is done. This means a lot of cut & paste if you want to manage apt settings across multiple environments or server roles. This is known, but will not be fixed as it is apparently by design.

Well, this design did not fit me, so I forked the puppetlabs module, updated it to enable proper hiera_hash look up, and published it to the puppet forge. There is no more difference with the original, but it does simplify my life a lot. Now if you define multiple sources in your hierarchy, for instance at datacenter level:

apt::sources:
 'localRepo':
   location: 'https://repo.example.com/debian'
   release: '%{::lsbdistcodename}'
   repos: 'main contrib non-free'

and at server level:

apt::sources:
  'puppetlabs':
    location: 'http://apt.puppetlabs.com'
    repos: 'main'
    key:
      id: '47B320EB4C7C375AA9DAE1A01054B7A24BD6EC30'
      server: 'pgp.mit.edu'

you will nicely have both sources in sources.list.d, instead of only having the one defined at server level.

You can find the source on github, and can download the module from the puppet forge. Installing it is as simple as:

puppet module install lomignet-apt

Better sudo messages

There is a nice feature to increase the “usefulness” of your error message when a user enters a wrong password.

Just add to /etc/sudoers the following line:

Defaults insult

et voilà:

admin@localhost:~$ sudo ls
[sudo] password for admin: 
I can't hear you -- I'm using the scrambler.
[sudo] password for admin: 
Have you considered trying to match wits with a rutabaga?
[sudo] password for admin: 
Wrong! You cheating scum!
[sudo] password for admin: 
You type like I drive.
[sudo] password for admin: 
Your mind just hasn't been the same since the electro-shock, has it?
[sudo] password for admin: 
stty: unknown mode: doofus
[sudo] password for admin: 
Listen, burrito brains, I don't have time to listen to this trash.
[sudo] password for admin: 
You speak an infinite deal of nothing
[sudo] password for admin: 
That's something I cannot allow to happen.
[sudo] password for admin: 
I feel much better now.
[sudo] password for admin: 
He has fallen in the water!
[sudo] password for admin: 
I don't wish to know that.

The only thing if you set sudo up that way is that you should expect a big surge in security logs once you users discover it and start playing with it.

Writing your first Sublime Text 3 plugin

Sublime text is an amazing text editor. Sleek, full of features, multi platform, very usable without a mouse. It is my editor of choice since a few years already.

One of its great advantages is that it is extensible in python, which makes it very easy to tweak.

I recently played with a vagrant box in which I needed to update a file. The file was mounted inside vagrant, but needed to be copied elsewhere inside the box, meaning it had to be copied manually every time I saved it. As I am very lazy (laziness drives progress, one of the favorite saying of my director of studies) I wanted to do that automagically. This is a very simple job, ideal for a first plugin.

So, where to start? Well, it is quite easy: on the menu bar click tools, then new plugin, Et voilà, you have your first plugin. Congratulations!

import sublime, sublime_plugin
class ExampleCommand(sublime_plugin.TextCommand):
 def run(self, edit):
 self.view.insert(edit, 0, "Hello, World")

This is a nice skeleton, but it does no go very far.

As I want to have an action on save, I needed to have an event listener plugin, in my case listening to a save event. I wanted to act after the event (use the file only after it was written). The API says that on_post_save_async is the best event for me, as it runs in a separate thread and is non blocking.

import sublime
import sublime_plugin


class UpdateOnSave(sublime_plugin.EventListener):

    def on_post_save_async(self, view):
        filename = view.file_name()
        # do something with the filename

Good, this is getting somewhere! All the base is present, now I just had to do something with this file.

The something is that case was a kind of subprocess call, to use vagrant ssh. Sublime already has a wrapper around subprocess, named exec. Exec can run in 2 contexts, view (basically the sublime buffer you are editing) to run TextCommands or window (sublime itself) to run all type of commands. Finding in which context to run your command is a bit of trial and error, but once done the last bit of the plugin is thus a trivial (once you know it) call to exec:

# please always use shlex with subprocess
import shlex
import sublime
import sublime_plugin
import os


class UpdateOnSave(sublime_plugin.EventListener):

    def on_post_save_async(self, view):
        filename = view.file_name()
        savedfile = os.path.basename(filename)
        saveddir = os.path.dirname(filename)

        # write in sublime status buffer
        sublime.status_message('Manually saving ' + filename)

        source_in_vagrant = '/vagrant/' + savedfile
        dest_in_vagrant = '/project/' + savedfile

        cmd_cp = "vagrant ssh -c 'sudo cp {0} {1}'".format(source, dest)

        view.window().run_command('exec', {
            'cmd': shlex.split(cmd_cp),
            'working_dir': saveddir,
        }
        )

Good, your plugin is ready! The last question is to know where to put it to have it actually used. With Sublime 3 under Linux, you need to have it in $HOME/.config/sublime-text-3/Packages/User. Note that it must be named something.py, with the .py extension, not the .py3 extension, or it would not be found.

You can find the plugin on github.