About (big) kafka broker id

I had quite a bit of fun setting up the kafka broker id, and those are my findings, hoping to save time to other poor souls like me.

TL;DR;

Set up in your kafka config

  • nothing to have auto-generated ids
  • broker.id=something_big AND reserved.broker.max.id=something_even_bigger to manually set your ids up.

Long Story

The broker id is a unique identifier per broker. Each broker in the cluster must have a different id, which is a positive int (meaning for java something less than 2147483647). This is all fine and dandy and works nicely if your ids are increasing from 1, 2…

Another option, nice for automated deployment, would be to generate ids based on the ip address, which should be unique in a DC thus (probably) in a cluster. With puppet, a nice ruby expression in a template like:

broker.id=<%= @ipaddress.split('.').inject(0) {|total,value| (total << 8 ) + value.to_i} & 0X7FFFFFFFF %>

would nicely do to generate a 31 bit int from the 32 bits IP (java has no unsigned int, so we cannot use the full range), discarding only the highest bit to keep as much variability as possible.

Now, it so happens that kafka can generate its ids as well, from a zookeeper sequence. To make sure there is no collision, the auto-generated ids will not be under the undocumented reserved.broker.max.id value, which is 1000 by default.

Conversely, manual ids cannot be above this limit. If you dare set up in your config file an id above this, kafka will just not start, and more annoyingly not give you any feedback beyond an exit code of 1. The solution once you discover this configuration option is easy, just set it up as high as possible, for instance to the max int possible:

reserved.broker.max.id=2147483647

The problem was to find out that it actually was the problem.

On a side note, changing the id after the first kafka start is a very bad idea, and you will end up with a message saying for instance:

kafka.common.InconsistentBrokerIdException: Configured brokerId 999 doesn’t match stored brokerId 838 in meta.properties

Get started with AWs and python

When you start for the first (or even second) time with AWS, it is a bit tricky to get your head around all the bits and bolts than need to be connected together. If on top of this you try to work with AWS in Beijing from outside China, the web GUI makes your work even harder because of slowness or even timeouts.

This scripts set up for you a full set of resources (vpc, route table, security group, subnet, internet gateway, instance with the relevant associations and attachments) for easy testing or bootstrapping of your infrastructure.

It is mostly meant as a testing help, so it does not handle all the options possible, but I find it invaluable to get started. You just need the AWS basics:

and it will do the rest for you. You need to provide a tag name (defaults to ‘roles’) and value, and all resources will be created and located via this tag, to allow for easy spawning and tearing down.

usage: fullspawn.py3 [-h] [--tag TAG] [--up | --down] [--wet | --dry]
 [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--cidr CIDR]
 [--ami AMI] [--keypair KEYPAIR] [--profile PROFILE]
 [--instance INSTANCE]
 role

Spawns a full AWS self-contained infrastructure.

positional arguments:
 role Tag value used for marking and fetching resources.

optional arguments:
 -h, --help show this help message and exit
 --tag TAG, -t TAG Tag name used for marking and fetching resources.
 (default: roles)
 --up, -u Creates a full infra. (default: up)
 --down, -d Destroys a full infra. (default: up)
 --wet, -w Actually performs the action. (default: dry)
 --dry Only shows what would be done, not doing anything.
 (default: dry)
 --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
 Verbosity level. (default: WARNING)
 --cidr CIDR The network range for the VPC, in CIDR notation. For
 example, 10.0.0.0/16 (default: 10.0.42.0/28)
 --ami AMI The AMI id for your instance. (default: ami-33734044)
 --keypair KEYPAIR A keypair aws knows about. (default: yourkey)
 --profile PROFILE Profile to use for credentials. Will use AWS_PROFILE
 environment variable if set. (default: default)
 --instance INSTANCE Instance type. (default: t2.micro)

For instance:

# Let's see what would happen when creating a full infra...
./fullspawn.py3 -t tag --up --dry testing
# Look good let's do it.
./fullspawn.py3 -t tag --up --wet testing
# oops, this was a stupid tag name
./fullspawn.py3 -t tag --down --wet testing

You probably want to have a look at some variables inside the script, setting a few defaults which might not be relevant for you. I am thinking about the ami (AMI), the keypair (KEYPAIR) and the ingress rules (INGRESS) all defined before the argparse calls.

The code is available on github.

Enjoy!

Vagrant, hostmanager, virtualbox and aws

No need to present Vagrant if you read this post. Hostmanager is a plugin which manages /etc/hosts on the host and the guests, to have common fqdns and simplify connections to your box. You can for instance set up host manager to make sure that you web development virtual is always accessible at web.local, or make sure that all machines in your virtual cluster can talk to each other with names like dev1.example.com, dev2.example.com and so on, no matter what DHCP decided what the IP address should be.

This is all and well, but there are a few issues. For instance, with virtualbox and dhcp, hostmanager only finds out 127.0.0.1 as IP for your virtuals, which is not handy to say the least.

If you use aws, you would like the guests to use the private IP, while having the host using the public IP, which is not possible by default.

Luckily, you can write your own IP resolver. The one I give here, which can as well be found on github, solves the following issues:

  • Only for Linux guests (probably other unices as well to be honest, but I cannot guarantee that hostname -I works on all flavours)
  • dhcp and virtualbox
  • public/private IP with aws
$cached_addresses = {}
# There is a bug when using virtualbox/dhcp which makes hostmanager not find
# the proper IP, only the loop one: https://github.com/smdahlen/vagrant-hostmanager/issues/86
# The following custom resolver (for linux guests) is a good workaround.
# Furthermore it handles aws private/public IP.

# A limitation (feature?) is that hostmanager only looks at the current provider.
# This means that if you `up` an aws vm, then a virtualbox vm, all aws ips
# will disappear from your host /etc/hosts.
# To prevent this, apply this patch to your hostmanager plugin (1.6.1), probably
# at $HOME/.vagramt.d/gems/gems or (hopefully) wait for newer versions.
# https://github.com/smdahlen/vagrant-hostmanager/pull/169
$ip_resolver = proc do |vm, resolving_vm|
  # For aws, we should use private IP on the guests, public IP on the host
  if vm.provider_name == :aws
    if resolving_vm.nil?
      used_name = vm.name.to_s + '--host'
    else
      used_name = vm.name.to_s + '--guest'
    end
  else
    used_name= vm.name.to_s
  end

  if $cached_addresses[used_name].nil?
    if hostname = (vm.ssh_info && vm.ssh_info[:host])

      # getting aws guest ip *for the host*, we want the public IP in that case.
      if vm.provider_name == :aws and resolving_vm.nil?
        vm.communicate.execute('curl http://169.254.169.254/latest/meta-data/public-ipv4') do |type, pubip|
          $cached_addresses[used_name] = pubip
        end
      else

        vm.communicate.execute('uname -o') do |type, uname|
          unless uname.downcase.include?('linux')
            warn("Guest for #{vm.name} (#{vm.provider_name}) is not Linux, hostmanager might not find an IP.")
          end
        end

        vm.communicate.execute('hostname --all-ip-addresses') do |type, hostname_i|
          # much easier (but less fun) to work in ruby than sed'ing or perl'ing from shell

          allips = hostname_i.strip().split(' ')
          if vm.provider_name == :virtualbox
            # 10.0.2.15 is the default virtualbox IP in NAT mode.
            allips = allips.select { |x| x != '10.0.2.15'}
          end

          if allips.size() == 0
            warn("Trying to find out ip for #{vm.name} (#{vm.provider_name}), found none useable: #{allips}.")
          else
            if allips.size() > 1
              warn("Trying to find out ip for #{vm.name} (#{vm.provider_name}), found too many: #{allips} and I cannot choose cleverly. Will select the first one.")
            end
            $cached_addresses[used_name] = allips[0]
          end
        end
      end
    end
  end
  $cached_addresses[used_name]
end

Just put this code in your vagrantfile, outside the Vagrant.configure block, and you can use it by allocating it that way:

config.hostmanager.ip_resolver = $ip_resolver

On a side note, and as explained in the comment, there is a limitation (feature?) in that hostmanager only looks at the current provider. This means that if you up an aws vm, then a virtualbox vm, all aws ips will disappear from your host /etc/hosts.

To prevent this, apply this patch to your hostmanager plugin (1.6.1), probably
at $HOME/.vagramt.d/gems/gems or (hopefully) wait for newer versions.
The patch itself is ridiculously tiny:

diff --git a/lib/vagrant-hostmanager/hosts_file/updater.rb b/lib/vagrant-hostmanager/hosts_file/updater.rb
index 9514508..ef469bf 100644
--- a/lib/vagrant-hostmanager/hosts_file/updater.rb
+++ b/lib/vagrant-hostmanager/hosts_file/updater.rb
@@ -82,8 +82,8 @@ module VagrantPlugins
 
         def update_content(file_content, resolving_machine, include_id)
           id = include_id ? " id: #{read_or_create_id}" : ""
-          header = "## vagrant-hostmanager-start#{id}\n"
-          footer = "## vagrant-hostmanager-end\n"
+          header = "## vagrant-hostmanager-start-#{@provider}#{id}\n"
+          footer = "## vagrant-hostmanager-end-#{@provider}\n"
           body = get_machines
             .map { |machine| get_hosts_file_entry(machine, resolving_machine) }
             .join

EMR – Elastic Map Reduce

Amazon has its own flavour of Hadoop, and this page explores in which case it is worth using it instead of a usual Hadoop distribution on top of EC2.

What EMR is

Elastic Map Reduce, this is basically an Amazon-flavoured Hadoop distribution, patched and optimised to run on AWS, targeted towards one-off or very infrequent processing. It uses either Amazon’s own Hadoop or MapR.

Plus points

It is pretty easy to set up. Going to the EMR setup page, you just have a few knobs to click on to get a cluster up and running. Basically you choose if you want Amazon or MapR, the set of applications to be bundled in and the number and type of instances in your cluster. This can be done in hardly a minute and the cluster will automagically be provisioned for you.

It seems pretty much up to date, Spark 1.5 was available within a month of its release for instance.

The cluster can be managed in different ways, via the GUI, the console or APIs, making it very flexible to scale in or out.

Min points

The usual min points of something which is managed for you apply. There is only a limited set of applications bundled in, namely Hadoop, Hive, Hue, Mahout, Oozie-Sandbox, Pig, Presto-Sandbox, Spark and Zeppelin-Sandbox. If you need another one or a different version you are out of luck. It is possible to do some manual installation or updates but probably defeats the purpose of paying extra to have a managed cluster.

Running costs are higher than using your Hadoop cluster on EC2, as you still have to pay not only for the EC2 servers but for EMR as well. The cost to have EMR is about 20-25% on top of EC2 costs.

The default storage is S3, which is not meant for low-latency access. This might not be an issue for the use cases where EMR is really good, but can definitely become a problem if low latency is a must for you.

Interesting notes

You have the option, when setting a cluster up, to choose for a long-running or transient life-cycle. This gives you the option to spawn a cluster for very infrequent jobs, have them run, and destroy the cluster (so not paying for it while idle) after completion.

Note that you cannot have more than 256 jobs (named steps) active at the same time. In older versions, 256 jobs was the total over the lifetime of the cluster.

Usage

It is really easy to submit a job. The storage is all in S3, so once

  • your input data is in s3
  • your job, consisting of a mapper and a reducer (jar or streaming in any language you wish)
  • you created an output directory in S3

You basically just have to fill these paths into a form and the job will run.

My experience is that as expected the latency is very high.

It is possible to chain steps, but you must then use AWS data pipeline, not covered here.

Summary

Basically, EMR would be great in 2 situations:

  • Very infrequent use of data without strong latency requirements. You can then spawn a transient cluster, have it do whatever processing you planned to do and destroy it to save costs afterwards.
  • If the costs associated with managing a cluster would be higher than the extra EMR costs. This would probably be the case for short term cluster, which reinforce the previous point.

Hortonworks, Cloudera or MapR?

This is one of the big questions when you start your first Hadoop project. Hadoop is Hadoop, right? So it should not matter which distribution you use? There is some truth is there, but there still are quite a few differences between these vendors, worth knowing about. After all, Linux is Linux, right? Debian or Redhat should not matter? You can just straight away to the quick answer, or carry on reading for more details.

Generalities

If you want to know more about Hadoop itself, you can check out the official Apache site, or just the wikipedia page for history and so on.

There are 3 big Hadoop distributions. Apache Hadoop itself, the root of them all, is not a distribution per se, as you can download each components individually but a lot of elbow grease is needed to tie everything together. The 3 main vendors bundle Apache Hadoop with other tools, open source as well as their own proprietary bricks to create distributions. Those are Cloudera, MapR and Hortonworks. There are other vendors as well, Microsoft (HdInsight, cloud only), Pivotal (Pivotal HD) and other I forget, but I concentrate on the big 3 here.

Quick answer

Use MapR if:

  • Performance is paramount,
  • You are a big company with strong audit requirements,
  • You know you will pay a licence for support.

Use Hortonworks if:

  • Open source is very important to you,
  • You do not want to pay for a licence but still want to do as much as possible (including security, authorisation),
  • You already have a datawarehouse (Terradata, Oracle, Vertica…) that you plan to carry on using but could offload or which does not allow all processing you plan to do.

Use Cloudera if:

  • You need to be PCI compliant
  • You want as much as possible automated for you, at the potential cost of a licence

Longer answer and description

A generic comment first. If you already plan to use some specific tools or Linux distributions, make sure that they are compatible for your version. For instance Tez does not run on Coudera, Impala would have problems on Hortonworks, and MapR does not support Debian (but Ubuntu).

MapR

MapR biggest differentiators are its filesystem and database, said to improve a lot the overall performance because it is highly optimised and skips the jvm and ext4 layers, while still being compatible with HDFS and HBase APIs. Their filesystem is a real filesystem, not append-only as HDFS is, and can be mounted via NFS which makes some administration tasks much easier.

MapR strives to support the whole Hadoop ecosystem (for instance Tez, Impala, Spark…) which on paper means that more tools should be supported by MapR than by the other distributions.

MapR is the only one to support volumes, which can give you very strong security and multi-tenancy, as you can control with a very fine grain who can access which volume.

On the bad side, MapR is pretty limited in its free version. HA for instance is only available with a licence. (EDIT: see comment from Anoop Dawar below, failover is now part of M3. the free version.)

As a nice starting point, you can spawn AWS instances configured for MapR, where the cost includes licence and support, without having to commit for a year. Usually AWS instances are about 2 months after the main MapR release due to extra testing and procedures.

Cloudera

Cloudera is the oldest Hadoop distribution. Their vision is to fully replace the warehouse by creating an Enterprise Data Hub and help the user a lot on the way.

The biggest strength of Cloudera is their automation. Cloudera manager and Navigator are amazing tools doing a lot for you, and are said to be superior to the equivalent of other distributions. That said, they are closed source, and although the manager is available for free, the navigator (security, governance) is not.

Another very strong point of Cloudera is Impala, a very fast open-source in-memory SQL database.

Cloudera is the only PCI-compliant distribution.

Cloudera claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.

Hortonworks

Hortonworks vision is not to fully replace a warehouse, but to use existing warehouse to provide offloading or new processes, thanks to the integration with multiple partners.

Hortonworks is a fully open source distribution. There is no licence to pay, only support if you so wish. The definition of open source for Hortonworks is very strict. For them open source means managed by a committee to not have ‘dictatorial’ open-source, where a project is technically open source, but only one company can accept (and usually refuses) contributions.

Ambari is the management tool for Hortonworks. Although it is quite new and did not have all the features you would want from a manager, it is improving at great speed and is supported by multiple organisations, thanks to being open-source.

Hortonworks supports Debian, but with an extra 1-month delay due to extra tests needed in comparison with the standard Redhat/CentOS version.

Hortonworks claims to have more Hadoop (and associated tools) committers on payroll than any other distribution.

Price comparison

This is always a big question, isn’t it? Here are a few prices I could gather. Those are just ballpark figures, and could of course be negociated.

MapR support (24/7) is around 4k$/server/year. This goes up to 6k if you want to include MapRDB as well. This include licence and support.

Cloudera support (24/7) is around 6.5k€/server/year. This includes server and licence. Note that Cloudera has multiple options, where you can elect to have full support (Enterprise), support for only one element (Flex) or support for only the core Hadoop, ie. HDFS, Hive and the like (Basic). Flex and enterprise provide the Navigator, but Basic is very cheap (500€/server/year).

Hortonworks does not provide a licence as it is fully opensource, but support (24/7) is priced at about 3.5k€/server/year.

Vendor lockin

This is usually a big concern, specially when talking about non open-source tools. I would claim that it is a non-problem.

Your data is always available via standard tools, and that is what matters the most. You will always be able to retrieve or export it in multiple ways. The rest (administration basically) is tied to your distribution anyway. If you do everything with the source Apache and puppet, use Ambari, Cloudera or MapR manager, it is not transferable to the other tool. In short, you are locked – administration-wise – anyway.

Graph the noise level in your office in 15 minutes

This is a recurrent complaint in any open space: “There is too much noise!” (the other one is that the climate is too cold/too hot). There are some usual culprits, but it is nice to have data to back your complaints up.

I will here show you how to generate a real-time noise level graph in 15 minutes without any material beside our laptop or desktop, not even a microphone is needed. This is a dirty hack, but it works and can be put in place very quickly with just a few command lines. The steps, which will be mostly cut&paste are:

  • install a noise recorder tool
  • set up nginx to serve the data recorded
  • use a nice javascript library to display the data properly

I used soundmeter, a python tool. So first, install it:

# make sure we can install python packages
apt-get install virtualenv
# install required dependencies for building soundmeter
apt-get install python-dev portaudio19-dev python-dev alsa-utils
# install other useful tools for later display
apt-get install nginx expect-dev
# set up a directory for the tool
mkdir $HOME/soundmeter
# create virtualenv
virtualenv $HOME/soundmeter
# activate it
source $HOME/soundmeter/bin/activate
# install soundmeter
pip install soundmeter

Et voilà, your recorder is setup.

But do you not need a microphone? Well, either you have a laptop with a build in microphone, either you can just plug an headphone, which is basically a microphone used the other way (produce instead of record sound).

To get data, a one-liner is enough:

soundmeter --segment 2 --log /dev/stdout 2>/dev/null | unbuffer -p perl -p -e 's/\s*(\d+)\s+(.{19})(.*)/"$2,". 20*log($1)/e' > meter.csv

The explanation is as follow:

  • soundmeter: run soundmeter forever (–seconds to limit the duration)
  • --segment 2: output data every 2 seconds (default 0.5 seconds, but is very spiky)
  • --log /dev/stdout: default data on stdout is not useful for graphing, we need to log to a file. Use /dev/stdout as file to actually log to stdout
  • 2>/dev/null: do not pollute output
  • |: the output is not in a great format, it needs to be reformatted
  • unbuffer -p: by default data is buffered, which is annoying for real-time view. This does what the name suggests
  • perl -p -e: yummy, a perl regexp!
  • s///e: this will be a substitution, where the replacement part is a perl expression
  • \s*(\d+)\s+(.{19})(.*): record value and timestamp stripped of the milliseconds
  • “$2,”: display first the timestamp with a comma for csv format
  • 20*log($1): the values from soundmeter are in rms, transform them in dB via the formula 20 * log (rms)
  • > meter.csv: save data in a file

In short, we do the following transformation on the fly and write it to a csv file:

2015-09-22 13:36:13,082 12 => 21.5836249,2015-09-22 13:36:13

You now have a nice csv file. How to display it? Via a nice html page with the help of a javascript library, dygraphs,of course.

Set up nginx by adding in /etc/sites-enabled/noise the following content (replace YOUR_HOME by your actual home directory, of course):

server {
 listen 80;
 root YOUR_HOME/soundmeter;
}

and restart nginx:

service nginx restart

Then setup you page in $HOME/soundmeter/noise.html:

<html>
<head>
<script src="//cdnjs.cloudflare.com/ajax/libs/dygraph/1.1.1/dygraph-combined.js"></script>

<style>
#graphdiv2 { position: absolute; left: 50px; right: 10px; top: 50px; bottom: 10px; }
</style>

</head>
<body>
<div id="graphdiv2"></div>
<script type="text/javascript">
 g2 = new Dygraph(
 document.getElementById("graphdiv2"),
 "http://localhost/meter.csv", // path to CSV file
 {
 delimiter: ",",
 labels: ["Date", "Noise level"],
 title: ["Noise (in dB)"],
 showRoller: true,
 }
 );
</script>
</body>
</html>

You can of course replace localhost by your IP to publish this page to your colleagues.

Now just go to http://localhost/noise.html:

noise

Indispensable tool of the day: thefuck

Thefuck is a magnificent app which corrects your previous console command.

Thefuck is a python tool, hosted on github, which looks at your previous command and tries to correct it. You of course invoke it after a typo for instance, by typing fuck in your terminal.

A few examples in image, from the developer himself:

thefuck in action

A few more examples:

Adds sudo for you:

~> apt-get install thefuck 
E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?
~> fuck
sudo apt-get install thefuck [enter/↑/↓/ctrl+c]
Reading package lists... Done

Corrects your git branch syntax (this is my main use of thefuck):

~/gits/lomignet-apt> git push
fatal: The current branch newfeature has no upstream branch.
To push the current branch and set the remote as upstream, use

  git push --set-upstream origin newfeature

~/gits/lomignet-apt> fuck
git push --set-upstream origin newfeature [enter/↑/↓/ctrl+c]
Total 0 (delta 0), reused 0 (delta 0)
To git@github.com:lomignet/lomignet-apt.git
 * [new branch] newfeature -> newfeature
Branch newfeature set up to track remote branch newfeature from origin.

The installation is trivial as the package is uploaded to the python package index:

pip install thefuck

To then have the command available, add to your .bashrc or whichever your startup script is:

eval $(thefuck --alias)

In the background, thefuck has a set of known rules which you can find in the readme. If those rules are not comprehensive enough for you, you can write your own.