About (big) kafka broker id

I had quite a bit of fun setting up the kafka broker id, and those are my findings, hoping to save time to other poor souls like me.

TL;DR;

Set up in your kafka config

  • nothing to have auto-generated ids
  • broker.id=something_big AND reserved.broker.max.id=something_even_bigger to manually set your ids up.

Long Story

The broker id is a unique identifier per broker. Each broker in the cluster must have a different id, which is a positive int (meaning for java something less than 2147483647). This is all fine and dandy and works nicely if your ids are increasing from 1, 2…

Another option, nice for automated deployment, would be to generate ids based on the ip address, which should be unique in a DC thus (probably) in a cluster. With puppet, a nice ruby expression in a template like:

broker.id=<%= @ipaddress.split('.').inject(0) {|total,value| (total << 8 ) + value.to_i} & 0X7FFFFFFFF %>

would nicely do to generate a 31 bit int from the 32 bits IP (java has no unsigned int, so we cannot use the full range), discarding only the highest bit to keep as much variability as possible.

Now, it so happens that kafka can generate its ids as well, from a zookeeper sequence. To make sure there is no collision, the auto-generated ids will not be under the undocumented reserved.broker.max.id value, which is 1000 by default.

Conversely, manual ids cannot be above this limit. If you dare set up in your config file an id above this, kafka will just not start, and more annoyingly not give you any feedback beyond an exit code of 1. The solution once you discover this configuration option is easy, just set it up as high as possible, for instance to the max int possible:

reserved.broker.max.id=2147483647

The problem was to find out that it actually was the problem.

On a side note, changing the id after the first kafka start is a very bad idea, and you will end up with a message saying for instance:

kafka.common.InconsistentBrokerIdException: Configured brokerId 999 doesn’t match stored brokerId 838 in meta.properties

Advertisement

EMR – Elastic Map Reduce

Amazon has its own flavour of Hadoop, and this page explores in which case it is worth using it instead of a usual Hadoop distribution on top of EC2.

What EMR is

Elastic Map Reduce, this is basically an Amazon-flavoured Hadoop distribution, patched and optimised to run on AWS, targeted towards one-off or very infrequent processing. It uses either Amazon’s own Hadoop or MapR.

Plus points

It is pretty easy to set up. Going to the EMR setup page, you just have a few knobs to click on to get a cluster up and running. Basically you choose if you want Amazon or MapR, the set of applications to be bundled in and the number and type of instances in your cluster. This can be done in hardly a minute and the cluster will automagically be provisioned for you.

It seems pretty much up to date, Spark 1.5 was available within a month of its release for instance.

The cluster can be managed in different ways, via the GUI, the console or APIs, making it very flexible to scale in or out.

Min points

The usual min points of something which is managed for you apply. There is only a limited set of applications bundled in, namely Hadoop, Hive, Hue, Mahout, Oozie-Sandbox, Pig, Presto-Sandbox, Spark and Zeppelin-Sandbox. If you need another one or a different version you are out of luck. It is possible to do some manual installation or updates but probably defeats the purpose of paying extra to have a managed cluster.

Running costs are higher than using your Hadoop cluster on EC2, as you still have to pay not only for the EC2 servers but for EMR as well. The cost to have EMR is about 20-25% on top of EC2 costs.

The default storage is S3, which is not meant for low-latency access. This might not be an issue for the use cases where EMR is really good, but can definitely become a problem if low latency is a must for you.

Interesting notes

You have the option, when setting a cluster up, to choose for a long-running or transient life-cycle. This gives you the option to spawn a cluster for very infrequent jobs, have them run, and destroy the cluster (so not paying for it while idle) after completion.

Note that you cannot have more than 256 jobs (named steps) active at the same time. In older versions, 256 jobs was the total over the lifetime of the cluster.

Usage

It is really easy to submit a job. The storage is all in S3, so once

  • your input data is in s3
  • your job, consisting of a mapper and a reducer (jar or streaming in any language you wish)
  • you created an output directory in S3

You basically just have to fill these paths into a form and the job will run.

My experience is that as expected the latency is very high.

It is possible to chain steps, but you must then use AWS data pipeline, not covered here.

Summary

Basically, EMR would be great in 2 situations:

  • Very infrequent use of data without strong latency requirements. You can then spawn a transient cluster, have it do whatever processing you planned to do and destroy it to save costs afterwards.
  • If the costs associated with managing a cluster would be higher than the extra EMR costs. This would probably be the case for short term cluster, which reinforce the previous point.