Replace a failed node in Vertica

Just a little semantic reminder before we dive in:

  • A host is a server on which Vertica is set up, but not necessarily used by a database. You can add hosts to a cluster to make them available to a database.
  • A node is a host part of a database.

If one of your node goes down (but the server is still up, I am thinking about data disk failure), according to the vertica documentation it is possible to replace it by another node with another IP address. This case never presented itself to me, so I will trust the documentation on that.

I had the issue of a dead host, though. In that case, the documentation is not enough. As part of the process of replacing a node, you need to add a new host to the cluster. While doing this via the update_vertica utility, Vertica will check connection between all hosts of the cluster. As one host is down, installation will fail.

In that case, the solution is not trivial but quite straightforward, and this is the goal of this post to explain it step by step.

  1. System preparation
  2. Update existing node info in the catalog via vsql
  3. Update admintools.conf
  4. Install Vertica on the new server
  5. Configure new node
  6. Restart new node

1 – System preparation

Make sure that Vertica is not installed and that /opt/vertica does not exist:

yum remove vertica
rm -rf /opt/vertica

2 – Update existing node info in the catalog via vsql

  • <failed_node_name>: the name of the host you want to replace, taken from the node_name column in v_catalog.nodes
  • <newip>: ip address of the replacement host
-- change the node IP
ALTER NODE &lt;failed_node_name&gt; HOSTNAME '&lt;newip&gt;';
-- change the nodes spread/control IP
ALTER NODE &lt;failed_node_name&gt; CONTROL HOSTNAME '&lt;newip&gt;';
-- re-write spread.conf with the new IP address and reload the running config
-- (db should remain UP)
SELECT RELOAD_SPREAD(true)

3 – Update admintools.conf

This must be done on a UP node. Any node will do, and in this post we will call it <source host>.

Edit the file /opt/vertica/conf/admintools.conf, by replacing all instances of the old ip address by the new ip address. This means that there will be 3 lines to update:

  • [Cluster] > hosts
  • [Nodes] : 2 lines, the one starting with the node name and the one with the node number.

For instance, assume we are replacing node2 from a 3-node cluster, from ip 10.0.0.2 to ip 10.0.0.42

Before, here were the relevant lines of admintools.conf:

[Cluster]
hosts = 10.0.0.1,10.0.0.2,10.0.0.3

[Nodes]
node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin
v_spil_dwh_node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin

After, notice the parts in bold:

[Cluster]
hosts = 10.0.0.1,10.0.0.42,10.0.0.3

[Nodes]
node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin
v_spil_dwh_node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin

4 – Install Vertica on the new server

On the same host used in the previous step, <source host> use update_vertica to add the new host to the cluster. The mandatory options are --rpm and the options you used at install time (which you can find in /opt/vertica/config/admintools.conf) or the path to your config file (--config-file/-z) if you used one.

Do NOT use the -S/--control-network, -A/--add-hosts or -R/--remove-hosts switches. You most likely will use -u/--dba-user, -g/--dba-group, -p/--dba-user-password and maybe a few more.

sudo /opt/vertica/sbin/update_vertica --rpm <complete path of RPM> -u <user> -g <group> ...

This script will verify all hosts, and will install the rpm on the new one.

5 – Configure new node

Login as dbadmin (or whichever user is your database administrator) on the new node, and recreate the base and data directory as they were in the failed node. Assuming that:

  • the failed node was node2,
  • your database is named $dwh,
  • your base directory is /home/dbadmin,

Then create the following:

mkdir /home/dbadmin/$dwh
mkdir /home/dbadmin/$dwh/v_$dwh_node0002_data
mkdir /home/dbadmin/$dwh/v_$dwh_node0002_catalog

You can look on any UP node to have an example of hierarchy.

Still from the node where you edited the config file, <source_host>, distribute them via the admintools:

  • run as dbadmin user /opt/vertica/bin/admintools
  • go to Configuration Menu > Distribute Config Files
  • select Database Configuration and Admintools Meta-Data.

If you cannot find spread.conf under /home/dbadmin/$dwh/v_$dwh_node0002_catalog copy it from <source_host>. You can check that spread.conf now have the IP 10.0.0.42 instead of 10.0.0.2.

Finally, as a last sanity check, have a look at /opt/vertica/config/admintools.conf and make sure that the new IP appears instead of the old one.

6 – Restart new node

Use the the admintools /opt/vertica/bin/admintools and select “Restart Vertica on Host”. The node wil then start the recovery process. All missing data will be copied to it, and once done it will join the cluster which will be complete again.

Vertica: move_partitions_to_table()

If you have a load in multiple steps (say data is first loaded in a staging schema, then moved to another one), using MOVE_PARTITIONS_TO_TABLE() is the best option. This is very fast, as you do not need to actually issue INSERT or DELETE statements or deal with the delete vectors. Vertica just moves in one go a block of data, and you are done.

When you issue this function, Vertica will first issue a moveout to empty the WOS before moving data. This does make sense, as if data is loaded in memory, you still want to make sure it is moved with the rest of the of the partition already on disk.

Since at least Vertica 7.0.1-1, using this function might give you the error message:

select move_partitions_to_table ('stg.A', 0, 24, 'vault.A')
ERROR: A Moveout operation is already in progress on projection vault.B_node0002

As you can see, a moveout on table A fails because of a moveout on table B, which does not seem to make sense.

It has actually been confirmed that it does not make sense, and it actually is a bug in Vertica 7.0.1-1, which should be fixed in theory in the next service pack.

In the mean time, you can work around the bug by issuing INSERTs/DELETEs, or have your code checks if a moveout is already running before issuing the move_partitions_to_table() statement.

EDIT: The hotfix 7.1.1-1 (11/2014) fixes the bug.

Avro end to end in hdfs – part 4: problems and solutions

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume
  3. How to use avro with hive
  4. Problems and solutions (This post)

Invalid/non standard schemas

The avro tools available for different languages are not all exactly equivalent. The default one for java used in Hadoop, for instance, has issues when some fields can be set to null. Nested array are another issues in a lot of cases. The default avro parser from java cannot handle them properly. Furthermore, if you end up finding a way to generate avro files with nested arrays, some tools will not be able to read them. Hive will be fine, but Impala (as of version 1.2) is not able to read them.

I can only urge you to use simple schemas, this will make your life a lot easier.

Hive partitions and schema changes

If you use Hive partitions (and you should), all data in one specific partition must have the same schema. We used to have partitions per hour when loading some logs, but now we are actually adding the avro schema version in the partition path. That way, data encoded in a new schema will end up in a different partition even if data is related to the same hour.

Faster encoding and flexibility

We started loading data the standard way, via flume. This created a lot of issues as explained earlier (nested arrays mostly), and flume was actually using a lot of resources. We ended up using the json2avro C tool, which is very fast and can handle nested arrays (but this bit us later because of impala). This tool generates avro files which we load in hdfs via a hdfs fuse mount point. This improved performance drastically. Since we are using this fuse mountpoint, we had no data loading issues or delay, whereas we had trouble every other week while using flume.

Default values

We started with writing a schema with default values. Sadly, we ended up noticing that JSON is only a convenient representation of data useful for debugging but is not the main purpose of avro.

This means that representing a missing source field in an avro schema can be done that way:

{"valid": {"boolean": true}, "source": null}

but a JSON document actually missing this field is not valid.

The real cost of puppet

I spent 2 months puppetising a set of servers that were already used in production, instead of doing it on the go as it should have been. After completion of the puppetisation, I then asked myself if spending these 2 months actually added value to the business worth more than 2 months of my time. This is what I try to analyse here.

To avoid adding up apples and bananas, I am giving to each item an actual economic ($$$) value. Finding out this value is actually the greatest part of the fun.

  1. What am I computing?
    1. What are the benefits associated to puppet?
    2. What are the costs associated with puppetising?
    3. Equation – first go
    4. Generalisation and simplification
    5. Benefit: Disaster recovery
    6. Benefit: satisfaction
  2. Time is money!
    1. My daily cost: Wages and related
    2. My daily added value
    3. Putting everything together
  3. Conclusion

What am I computing?

What are the benefits associated to puppet?

  • Knowledge sharing: when I quit or am fired (which happened to me way more often than being hit by a bus or a meteorite), everything I did is documented in the form of puppet manifests.
  • Scaling out: when I need to add a new server to a cluster, this can be done automatically instead of involving manual work.
  • Disaster recovery: if a server goes down and needs to be rebuilt, if my puppet-fu is good enough I just need to press a button to get a fully functional server instead of spending days to rebuild it.
  • Ease of update: changing one puppet manifest and apply it everywhere is easier than updating a whole cluster of servers.
  • Satisfaction: most tasks are fun to do once but boring when you need to do it over and over again.

After I finished this post, a coworker gave me some very interesting feedback. Another huge benefit is the ability to create new workflows. Combined with Openstack for instance, it is very easy to spawn a test instance to try an idea out. If it would take days or even only hours instead of minutes to have test servers ready, those ideas might never be tested. At this moment, I am not sure yet how to measure that, so I will have to keep it out the equation for now.

What are the costs associated with puppetising?

  • My wages and related expenses from my company.
  • Value I did not create while working on puppet.

Equation – first go

What I am trying to find out is if the benefits are greater than the cost. Mathematically, I am trying to find the sign of (benefits – costs). If this is positive, ie. greater than 0, I added value to the business. If not, maybe I should be fired.

benefits – costs =
(knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)

Generalisation and simplification

At the end of the day, most of those values can be matched to time spent doing something. I will thus start by simplifying the equation by reducing as much as possible in days, and will find the value of my days from there. Some values are one offs (knowledge sharing, puppetisation), some are recurrent. I am looking at one year worth of value, but am adding the Y variable to be able to see further in the future. Maybe I did not add enough value for next year, but enough if we look at the next 2 or 3 years. Same again, it might be that our current number of servers does not make puppetising valuable, but a bigger number might make it worthwhile. That’s why I introduced the S variable as well, S being the number of servers I manage.

  • Total time spent: 2 months are about 40 working days (counting week end, meetings, occasional day off and so on)
  • Knowledge sharing: I estimate that it would take about 6 days to understand the servers without documentation and to get the details mostly correct.
  • Scaling out can be summed up as time I would spend to install a server manually (0.5 days in average, with one outlier taking 2 days) multiplied by the number of times I need to install one of those server. This simplifies out to about 0.5 * 0.1 * S * Y ie. every year I increase our server farm by 10 percent. For instance, if I have 100 servers today, I expect to spend 5 days (10 servers @ 0.5 day per server) to scale out manually next year.
  • Update time is harder to compute as ansible does help a lot. It probably is not that relevant, but based on my experience it could save about 0.02 * S days a year. For instance, if I have 100 servers today, I expect to save 2 days with puppet compared to ansible.
  • My wages + value I create on a daily basis can be reduced to time spent (days), and will be computed later.

This gives us the next line of our equation:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.5 * 0.1 * S * Y days + disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days

Benefit: Disaster recovery

The three main questions here are:

  1. If this bit of the system goes down for one day, how much would it cost to the business?
  2. What is the probability that this bit of the system goes down?
  3. How long would it take to restore said system?

There is one very easy way to answer the first question. Just take said system down, hide for a day, and look at the reports the following day. Somehow, my coworkers were not thrilled with the idea. Luckily (or not), we had such an occurrence a few weeks back. No production server was impacted, but some predictive models and recommendations were not updated for 24 hours. This was the perfect data for me! It so happens that getting daily revenues is actually not so trivial, but I got my answer ($moneyloss). This is actually a low bound, as the servers being down might have rippled impact over a few days.

The probability that a bit of the system goes down can be computed from the MTBF. I will here use an average MTBF of 50000 hours. This value is a common value for servers, but does not take in account the rate at which data is written to disk for instance. A disk receiving a lot of data before exporting it to Hadoop would have a failure rate much higher than a disk less used. Same again, I assume that the failure rate is linear and do not care about the bathtub curve.

We already saw the cost of scaling out. I will estimate that disaster recovery cost twice this value, to handle the pressure, maybe finding more hardware and so on.

What does that give us?

In the next year (Y), each server will fail (365*24)/50000 times. Multiply that by the number of servers (S) and we are almost there. One server is actually still a single point of failure, and would create an outage if it is down. This give us: Y * (365*24)/50000 * S days +Y * (365*24)/50000 * money lost due to SPoF:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.5 * 0.1 * S * Y days+ disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + satisfaction) – 40 days

Benefit: satisfaction

Based on a paper by Alex Edmans from the London Business School: The Link Between Job Satisfaction and Firm Value, With Implications for Corporate Social Responsibility, a happy employee would perform about 3% better. This has been computed by comparing the growth of the best companies to work for versus the rest of the industry.  Well, puppetising does not make me that happy, but I will assume that I was happy for those 2 months (1/6th of the year), so I performed 1/6th better, and thus brought 0.05% extra.

The equation becomes:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.1 * S * Y + Disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y  days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + 0.05 * 40 days) – 40 days

We now have almost all values in. By simplifying it, it gives us:

benefits – costs
= (0.3 * S * Y – 32) days + Y * 0.2 * $moneyloss

As expected, the result will depend on how long we look into the future and how much servers I am managing.

Time is money!

My daily cost: Wages and related

This is the easiest bit although even this computation is not straightforward. The cost of an employee cannot be measured only by their wages. Think about hiring costs, space (rent) for the desk, secondary benefits (pension, training), management cost (if my team did not exist, my manager would be useless and let go), equipment (computer, phone) and so on. This of course depends on your company, your type of contract, your country of work or your type of work.

The cost of an employee is often described as their base wages multiplied by a factor. This factor is usually considered being between 1.5 and 2.7. I will use a value of 2.

My daily added value

You can argue that I put those 2 months in one go, instead of spreading them over time if I had puppetised everything when it should have been done. This is true, but what I am trying to work out is if it was worth puppetising in the current context, meaning with everything already up and running in production.

So, which value do I bring to my company on a monthly basis?

This always a hard question to answer for a backend guy. I cannot claim to have closed a sale that brought $zillions, or created a model which generated a 25% uplift in revenues. Conversely, the sale guy or the analyst would not have been able to do their job without me doing mine (or so I want to believe at least). I thus went around to ask various people for their opinion.

My manager answered me that I above all bring knowledge, and this is priceless. When I then argued that as I am priceless I should get a big raise, I sadly only got a resounding ‘no’. This makes me believe that I am, in fact, not priceless and that my manager does not believe it either. Other answers were that I am worth exactly what I negotiated as salary package, because this is how the CFO sees me: this dwh guy in the corner with a price tag on his forehead. Another answer was that from a pure money viewpoint, we should only use on-demand consultants, which we bring in only when we need them. None of those answers put a value on my worth, only on what I cost, so I carried on asking.

The best theoritical answer I got was from analyst. Basically, I ‘just’ need to create model where the main KPI ($€£) is put in relation with my start date (and ideally my quitting date). Other employees are confounding factors which must be handled by the model. This is not easy and would be a project of its own, for which I already know that the value generated would not exceed the cost incurred.

I eventually used $moneyloss (daily money lost if the servers I managed are down), and divided this by the number of people in the team, to give $myvalue.

Conversely, if I use the total revenue from my company divided by the number of total employees, I end with a value 3 times as big. This either means that my team does not perform that well, either that we provide value in hidden ways, by helping other people make better decisions, for instance.

Putting everything together

benefits – costs
= (0.3 * S * Y – 32) days + Y * 0.2 * $moneyloss
=(0.3 * S * Y – 32) * (2 * $wages + $myvalue) + Y * 0.2 * $moneyloss

Now is the exciting thing. Putting all the numbers together for one year, I end up with…

benefits < costs

Ah. Maybe I should be fired… Luckily, this is for year one only. This levels out on year 2, and it starts generating value after that. This is very similar for the whole possible range for $myvalue.

Conclusion

I did try to base all this computations on actual data. There are of course some assumptions, some rounding that should probably not be there (mtbf), some data are very hard to compute (my daily value), some would deserve better research (cost of one day down). But the summary is that yes, puppet is valuable. It was a specific case for me as this was the first time I actually used puppet, so there was a big learning curve and if I had to do it again now I would be a lot faster.

I learned as well a lot about my value, my cost, the complexity to actually find them out and how the company’s revenues react to mistakes I make. This was a real eye opener. It is  of course not a secret that knowing these numbers helps making better decisions. If I had those numbers for every aspects of my company’s platform, prioritising would be a lot easier.

This was quite a fun post to write, which took way longer than expected as I discovered ramifications along the way. I can only encourage you to do the same, and think about what you would do if you find out the value you bring is exactly 0€/day? exactly equals to your wages? 10 times your wages? 1 million€/day?

The beginner’s Agile and Lean library

There is a lot said online about Agile and Lean. The distinction between the 2 would be a post on its own, but I would like to present here a few books which are a very good introduction. A lot of them are specifically geared towards software development, but most of the practices and principles can be adapted to other areas as well. The best examples is that a lot of the lean principles actually come from the car manufacturing industry.

 Suggested reads

Kanban: Lean from the Trenches, Henrik Kniberg. This is not a textbook, but a real world use of kanban by the swedish police. This is a very easy read, with loads of pictures to help you visualise and understand what this book is about. Just by reading this book, you will learn a lot about many aspect, practices and artefacts of lean and agile. If you read only one, I would suggest this one, as it does not delve too much into theory but follows a project from start to finish, which helps understand what agile and lean are about. After reading it, you will probably want to dig some parts deeper, and other books will become interesting.

Lean: An Agile toolkit, Tom & Mary Poppendieck. A classic, this book will provide you with a set of tools and principles that constitute the core of lean: eliminate waste, decide as late as possible, empower the team… Even only keeping this principles in mind and forgetting the details will go a long way in making your team, product or company better. If you want to know more about lean after having read Lean from the Trenches, I wholeheartedly recommend that one.

Scrum: The Elements of Scrum, Chris Sims & Hillary Louise Johnson. This book is a very easy read as well, specialised in Scrum. You will learn about the roles (product owner, scrum master…), artefacts (backlogs, charts) and practices (stand up, demo…). I followed a Scrum Master training after having read this book, and although I still did learn things during the training and enjoyed the practical aspects, this book taught me the majority of the theory.

Further reads

Hardcore theory: The Principles of Product Development Flow: Second Generation Lean Product Development, Donald G. Reinertsen. This book is just amazing. If you read it, I would strongly suggest to buy a dead tree version, as I found myself going back and forth a lot between chapters to refresh my memory. This is the only book I own for which I bought a dead tree version after having bought the ebook version. Basically, the goal of this book it to put a financial value on everything, which then allows you to compare practices by having a common measurement. You can read this book at many levels. Again, the principles are great to remember (to name just one, the main one I believe: the cost of delay. For everything you do, ask yourself the question of how much the extra delay will cost the company), and only for that this book will teach you a lot. If you are not scared of sentences like ‘The arrival queue in Kendall’s notation can be modelled by a M/M/1/∞ where M is a Markov process’, this book will move from great to fantastic.

About delaying commitment: Commitment – the book, Olav Maassen, Chris Matts, Chris Geary. A nice graphical novel specifically about the value of keeping your options open up to the last responsible moment. You will follow a project from its impending doom to its shining success, thanks to good lean practices. On top of learning about keeping your options open, you will learn about a few lean practices as well.

The Mythical Man-Month: Essays on Software Engineering, Frederick P. Brooks Jr. : I have not read that one yet, but this is a classic referenced a lot which I will read at one point. According to wikipedia, this book is widely regarded as a classic on the human elements of software engineering.

Build a rpm for pentaho pdi (kettle)

To nicely deploy pdi at $work, I wanted to have it in our yum repositories. For this I used the fantastic fpm, the Effing Package Manager which enables you to build rpm without having to deal with complex spec files. In short you tell it that you want to build a rpm from a directory, no other options are mandatory, and it just works (but a few options are nice to tweak).

If you use my startup script, you can even add it in the rpm. The final command is something like:

fpm -s dir -t rpm \
  --name pentaho-pdi \
  --version 4.3.0 \
  --depends jdk \
  --vendor 'me@thisdwhguy.com' \
  --url 'https://github.com/pentaho/pentaho-kettle' \
  --description 'Pentaho pdi kettle' \
  --maintainer 'Me &lt;me@thisdataguy.com&gt;' \
  --license 'Apache 2.0' \
  --epoch 1 \
  --directories /opt/pentaho_pdi \
  --rpm-user pentaho \
  --rpm-group pentaho \
  --architecture all \
  --after-install=after-install.sh \
  ./pentaho_pdi=/opt \
  ./carte=/etc/init.d/carte

It wil probably not be exactly what you want, regarding paths and user. Furthermore, the after-install script needs to be generated (it just sets up ownership and rights of /etc/init.d/carte)

To make it easier, I created a small bash script with a few configuration variables and a few extra checks (mysql and vertica jars) which makes building very easy. You can just get this script, remove the checks if they are irrelevant to you, and you should be good to go. The script will even install fpm for you if needed.

#!/bin/bash

if [ "x$1" == "x" ]; then
  echo "Need one parameter: the pentaho version string (eg 5.2.0.1)"
  exit 1;
else
  PDIVERSION=$1
fi

# name of the directory where pdi will be installed
PDIDIR=pentaho_pdi
# user to own the pdi files
PDIUSER=pentaho
# root of where pdi will be installed
PDIROOT=/opt


if ! which fpm 1&gt;/dev/null 2&gt;/dev/null; then
    echo "fpm is not installed. I will try to do it myself"
    echo "Installing relevant rpms..."
    sudo yum install -y ruby-devel gcc
    echo "Installing the fpm gem..."
    sudo gem install fpm
    if ! which fpm 1&gt;/dev/null 2&gt;/dev/null; then
        echo "failed installing fpm, please do it yourself: https://github.com/jordansissel/fpm"
        exit 1
    fi
else
    echo "fpm installed, good."
fi

if [ ! -d "$PDIDIR" ]; then
    echo "I expect a directory called $PDIDIR."
    echo "It is the 'dist' directory built from source renamed as $PDIDIR."
    echo "Look at https://github.com/pentaho/pentaho-kettle"
    exit 1
else
    echo "$PDI_DIR directory exists, good."
fi

ERRORS=0

find $PDIDIR -name \*mysql\*.jar | grep -qE '.*'
if [[ $? -ne 0 ]]; then
    echo  "Download the mysql jar from http://dev.mysql.com/downloads/connector/j/ and put it in the libext/JDBC (&lt;5.0) or lib (&gt;= 5.0) subdirectory of $PDIDIR."
    ERRORS=1
else
    echo "Mysql jar present in $PDIDIR, good."
fi

find $PDIDIR -name \*vertica\*.jar | grep -qE '.*'
if [[ $? -ne 0 ]]; then
    echo  "Get the vertica jar from /opt/vertica and put it in the libext/JDBC (&lt;5.0) or lib (&gt;= 5.0) subdirectory of $PDIDIR."
    ERRORS=1
else
    echo "Vertica jar present in $PDIDIR, good."
fi

if [[ $ERRORS -eq 1 ]]; then
    exit 1
fi

# the init.d script will be installed as $PDIUSER, whereas it should be root

# Check that carte init script exists, if yes add it to the options
if [ -f ./carte ]; then
(cat &lt;&lt; EOC
#!/usr/bin/env sh
chown root:root /etc/init.d/carte
chmod 744 /etc/init.d/carte
chkconfig --add carte
EOC
) &gt; ./after-install.sh
    echo "After install script for carte generated at after-install.sh"
    CARTEOPTIONS="--after-install=after-install.sh ./carte=/etc/init.d/carte"
else
    CARTEOPTIONS=""
    echo "No Carte init.d script present."
fi


# All good, let's build
echo "Build the effing rpm, removing existing rpms first..."
rm -f pentaho-pdi*rpm
fpm -s dir -t rpm \
  --name pentaho-pdi \
  --version $PDIVERSION \
  --depends jdk \
  --vendor 'me@thisdataguy.com' \
  --url 'https://github.com/pentaho/pentaho-kettle' \
  --description 'Pentaho pdi kettle' \
  --maintainer 'me@thisdataguy.com' \
  --license 'Apache 2.0' \
  --epoch 1 \
  --directories $PDIROOT/$PDIDIR \
  --rpm-user $PDIUSER \
  --rpm-group $PDIUSER \
  --architecture all $CARTEOPTIONS \
  ./pentaho_pdi=/$PDIROOT \

rm -f after-install.sh

This will create a pentaho-pdi-${PDIVERSION}.noarch.rpm which you can just yum install or put it in your yum repositories.

pentaho pdi (kettle) carte init.d script

The following script is an LSB compliant init.d script allowing the carte webserver of pdi (kettle) to start at boot time.

Once the script is copied at /etc/init.d/carte you can use it the usual way:

service carte start
service carte status
service carte stop
service carte restart

To actually have carte started at boot time, register the script with chkconfig for redhat flavours:

chkconfig --add carte

or use update-rc.d carte defaults on debian flavours:

update-rc.d carte defaults

Note that under Redhat it uses the ‘/etc/rc.d/init.d/functions’ helper script which gives out nice colored output for [OK] or [FAILED]. If this script does not exist, a fallback is present, a bit less nice but working just as well. This means that in theory this script should work under all flavors of Linux.

There are a few configuration variables near the top of the script (user to run carte under, path to carte and port to listen to), but that is about it. You can find the script on github as well.

#!/bin/bash

# Start the carte server as a daemon, and helps managing it in a normal
# (service carte start/stop/status) way.

# Licence: FreeBSD, do what you want with it but do not hold me responsible.

### BEGIN INIT INFO
# Provides: pdi
# Required-Start: $network $syslog $remote_fs
# Required-Stop: $network $syslog
# Default-Start: 3 5
# Default-Stop: 0 1 2 4 6
# Short-Description: Pentaho Carte Server
# Description: Pentaho Carte Server
#
### END INIT INFO

## configuration directives
# Which user does carte run under?
PDIUSER=pentaho
# On which port should it listen?
CARTEPORT=80
# Where is pdi installed?
PDIROOT=/opt/pentaho_pdi
# Normal output log
LOGOUT=/var/log/pentaho_pdi.out.log
# Error output log
LOGERR=/var/log/pentaho_pdi.err.log

## script start here

# Note: The functions script is RH only. It is only used here for sexy (colored)
# output of [OK] or [FAILED] via echo_failure and echo_success.
#
# To make this script work under other flavors of linux, the 2 echo_ functions
# are first defined in a portable (but unsexy) way. If the RH functions script
# exists, its definition will override the portable way.
function echo_failure() { echo -en "\n[FAILED]"; }
function echo_success() { echo -en "\n[OK]"; }
[ -f /etc/rc.d/init.d/functions ] && source /etc/rc.d/init.d/functions

# Very useful for debugging
#set -x

# Find PID of the newest (-n) process owned by $PDIUSER (-u) with carte.sh on
# the full (-f) command, arguments included.
# => this should yield the pid of 'sh ./carte.sh' on STDOUT, with a status of 0
# if there is such a process, 1 otherwise
FINDPID="pgrep -u $PDIUSER -n -f carte.sh";
function _is_running() {
    $FINDPID 1>/dev/null
    return $?
}

function stop_carte() {
    _is_running
    if [ $? -ne 0 ]; then
        echo -n "$0 is not running, cannot stop."
        echo_failure
        echo
        return 1
    else
        echo -n "Stopping $0..."
        # Finding the pid of carte.sh from $FINDPID. Killing it would leave its
        # child, the actual java process, running.
        # Find this java process via ps and kill it.
        $FINDPID | xargs ps h -o pid --ppid | xargs kill
        sleep 1
        _is_running
        if [ $? -eq 0 ]; then
            echo_failure
            echo
            return 1
        else
            echo_success
            echo
            return 0
        fi
    fi

}

function status() {
    _is_running
    if [ $? -eq 0 ]; then
        echo -n "$0 is running."
        echo_success
        echo
        return 0
    else
        echo -n "$0 does not run."
        echo_failure
        echo
        return 1
    fi
}

function start_carte() {
    _is_running
    if [ $? -eq 0 ]; then
        echo -n "$0 already running."
        echo_failure
        echo
        return 1
    else
        echo -n "Starting $0..."
        # Make sure log files exist and are writable by $PDIUSER first
        touch $LOGOUT $LOGERR
        chown $PDIUSER:$PDIUSER $LOGOUT $LOGERR
        su - $PDIUSER -c "cd $PDIROOT && (nohup sh ./carte.sh $(hostname -i) $CARTEPORT 0<&- 1>>$LOGOUT 2>>$LOGERR &)"
        sleep 1
        _is_running
        if [ $? -eq 0 ]; then
            echo_success
            echo
            return 0
        else
            echo_failure
            echo
            return 1
        fi
    fi
}

case "$1" in
    start)
        start_carte
        exit $?
        ;;
    stop)
        stop_carte
        exit $?
        ;;
    reload|force-reload|restart|force-restart)
        stop_carte
        if [ $? -eq 0 ]; then
            start_carte
            exit $?
        else
            exit 1
        fi
        ;;
    status)
       status
       exit $?
       ;;
    *)
        echo "Usage: $0 {start|stop|restart|status}"
        exit 2
esac
exit 0

Vertica: rename multiple tables in one go

I have the use case where I need to regularly fully drop and recreate a table in Vertica. To try to keep the period without data to a minimum, I want to load data in an intermediate table, then rename the old table out of the way, and rename the intermediate to its final name. It turns out that Vertica allows this in one command, hopefully thus avoiding race conditions.

After loading, before renaming:

vertica=> select * from important;
 v
-----
 old
(1 row)

vertica=> select * from important_intermediate ;
 v
-----
 new
(1 row)

Multiple renaming:

ALTER TABLE important,important_intermediate RENAME TO important_old, important;

after:

vertica=> select * from important;
 v
-----
 new
(1 row)
vertica=> select * from important_old;
 v
-----
 old
(1 row)

You will probably want to DROP import_old now.

Avro end to end in hdfs – part 3: Hive

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume
  3. How to use avro with hive (this post)
  4. Problems and solutions

Use avro in Hive

Once your table is created, and data is loaded, there is nothing extra to do, you can just query it as you would any other table.

Create the table

Creating the table can be done as follow, with some comments:

-- table name
CREATE EXTERNAL TABLE IF NOT EXISTS table_name

-- Partition according to the end of the path you set in the flume sink (hdfs.path option).
-- Following the example form previous post, we would have
PARTITIONED BY (key STRING)

-- Avro!
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

-- Matches the first part of hdfs.path set up in the flume sink
-- Following the example of the previous post, we would have
LOCATION '/datain/logs'

-- Other options here are to hardcode the schema or use a file on your local filesystem instead.
TBLPROPERTIES ('avro.schema.url'='hdfs:///schemas/schema.avsc');

More information can be found on the cloudera documentation about hive and avro.

Load the snappy jar

To load data, you need to tell Hive that the data files will be compressed, and Hive needs to know how to decompress. For this, you need to add the snappy jar to the list of extra jars loaded by Hive. This is done by adding the path to the snappy jar to the value to the hive.aux.jars.path property of your hive-site.xml. For instance:

<property>
  <name>hive.aux.jars.path</name>
  <value>file:////usr/lib/hive/lib/hive-contrib.jar,...,file:////usr/lib/hive/lib/auxlib/snappy-java-1.0.4.1.jar</value>;
</property>

Actually load data

You need to tell hive to use snappy, which is done the following way:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Then loading data means creating a new partition when a new directory is created with another key. Run this after having told hive to use snappy:

ALTER TABLE table_name
ADD IF NOT EXISTS PARTITION /datain/logs/key=some_new_key
LOCATION '/datain/logs';

Using data with the default schema

If you use a custom schema, tailored to your data, you can then enjoy the full speed of Hive, as not much parsing will be needed by Hive to access your data.

If you use the default schema, then Hive does not know (yet) about the columns in your table. This can be fixed by the decode() function. For instance,

SELECT
hour, decode(body,'UTF-8') as body
FROM my_table

Avro end to end in hdfs – part 2: Flume setup

This is a series of posts aiming at explaining how and why to set up compressed avro in hdfs. It will be divided in a few posts, more will be coming if relevant.

  1. Why avro?
  2. How to set up avro in flume (this post)
  3. How to use avro with hive
  4. Problems and solutions

Set up flume

Believe it or not, this is the easy part.

On the source, there is nothing specific to add, you can carry on as usual.

On the sink here is a sample with comments:

agent.sinks.hdfs.type=hdfs
# Very important, *DO NOT* use CompressedStream. Avro itself will do the compression
agent.sinks.hdfs.hdfs.fileType=DataStream
# *MUST* be set to .avro for Hive to work
agent.sinks.hdfs.hdfs.fileSuffix=.avro
# Of course choose your own path
agent.sinks.hdfs.hdfs.path=hdfs://namenode/datain/logs/key=%{some_partition}
agent.sinks.hdfs.hdfs.writeFormat=Text
# The magic happens here:
agent.sinks.hdfs.serializer=avro_event
agent.sinks.hdfs.serializer.compressionCodec=snappy

Note the hdfs.path. “some_key” might be timestamp, for instance, which could create a new directory every hour. This will be used later in Hive.

Using this configuration will use the default Avro schema, which you can find defined in the flume source:

{
 "type": "record",
 "name": "Event",
 "fields": [{
   "name": "headers",
   "type": {
     "type": "map",
     "values": "string"
   }
 }, {
   "name": "body",
   "type": "bytes"
 }]
}

If you want to use your own custom schema, you need to extend AbstractAvroEventSerializer. This is not very complex, and the default avro event serializer actually extends it already, hardcoding a schema. This is a good example to carry on. You would typically out the schema at an place reachable by the sink, being either hdfs itself or an url. The path could be hardcoded in your class if you have one schema only, or could be passed as a flume header.

If, as in the example, you are using snappy, first make sure that snappy is installed:

# RedHat world:
yum install snappy
# Debian world:
apt-get install libsnappy1

And that’s really it, there is nothing more to do to use the default schema.