Build a rpm for pentaho pdi (kettle)

To nicely deploy pdi at $work, I wanted to have it in our yum repositories. For this I used the fantastic fpm, the Effing Package Manager which enables you to build rpm without having to deal with complex spec files. In short you tell it that you want to build a rpm from a directory, no other options are mandatory, and it just works (but a few options are nice to tweak).

If you use my startup script, you can even add it in the rpm. The final command is something like:

fpm -s dir -t rpm \
  --name pentaho-pdi \
  --version 4.3.0 \
  --depends jdk \
  --vendor 'me@thisdwhguy.com' \
  --url 'https://github.com/pentaho/pentaho-kettle' \
  --description 'Pentaho pdi kettle' \
  --maintainer 'Me <me@thisdataguy.com>' \
  --license 'Apache 2.0' \
  --epoch 1 \
  --directories /opt/pentaho_pdi \
  --rpm-user pentaho \
  --rpm-group pentaho \
  --architecture all \
  --after-install=after-install.sh \
  ./pentaho_pdi=/opt \
  ./carte=/etc/init.d/carte

It wil probably not be exactly what you want, regarding paths and user. Furthermore, the after-install script needs to be generated (it just sets up ownership and rights of /etc/init.d/carte)

To make it easier, I created a small bash script with a few configuration variables and a few extra checks (mysql and vertica jars) which makes building very easy. You can just get this script, remove the checks if they are irrelevant to you, and you should be good to go. The script will even install fpm for you if needed.

#!/bin/bash

if [ "x$1" == "x" ]; then
  echo "Need one parameter: the pentaho version string (eg 5.2.0.1)"
  exit 1;
else
  PDIVERSION=$1
fi

# name of the directory where pdi will be installed
PDIDIR=pentaho_pdi
# user to own the pdi files
PDIUSER=pentaho
# root of where pdi will be installed
PDIROOT=/opt


if ! which fpm 1>/dev/null 2>/dev/null; then
    echo "fpm is not installed. I will try to do it myself"
    echo "Installing relevant rpms..."
    sudo yum install -y ruby-devel gcc
    echo "Installing the fpm gem..."
    sudo gem install fpm
    if ! which fpm 1>/dev/null 2>/dev/null; then
        echo "failed installing fpm, please do it yourself: https://github.com/jordansissel/fpm"
        exit 1
    fi
else
    echo "fpm installed, good."
fi

if [ ! -d "$PDIDIR" ]; then
    echo "I expect a directory called $PDIDIR."
    echo "It is the 'dist' directory built from source renamed as $PDIDIR."
    echo "Look at https://github.com/pentaho/pentaho-kettle"
    exit 1
else
    echo "$PDI_DIR directory exists, good."
fi

ERRORS=0

find $PDIDIR -name \*mysql\*.jar | grep -qE '.*'
if [[ $? -ne 0 ]]; then
    echo  "Download the mysql jar from http://dev.mysql.com/downloads/connector/j/ and put it in the libext/JDBC (<5.0) or lib (>= 5.0) subdirectory of $PDIDIR."
    ERRORS=1
else
    echo "Mysql jar present in $PDIDIR, good."
fi

find $PDIDIR -name \*vertica\*.jar | grep -qE '.*'
if [[ $? -ne 0 ]]; then
    echo  "Get the vertica jar from /opt/vertica and put it in the libext/JDBC (<5.0) or lib (>= 5.0) subdirectory of $PDIDIR."
    ERRORS=1
else
    echo "Vertica jar present in $PDIDIR, good."
fi

if [[ $ERRORS -eq 1 ]]; then
    exit 1
fi

# the init.d script will be installed as $PDIUSER, whereas it should be root

# Check that carte init script exists, if yes add it to the options
if [ -f ./carte ]; then
(cat << EOC
#!/usr/bin/env sh
chown root:root /etc/init.d/carte
chmod 744 /etc/init.d/carte
chkconfig --add carte
EOC
) > ./after-install.sh
    echo "After install script for carte generated at after-install.sh"
    CARTEOPTIONS="--after-install=after-install.sh ./carte=/etc/init.d/carte"
else
    CARTEOPTIONS=""
    echo "No Carte init.d script present."
fi


# All good, let's build
echo "Build the effing rpm, removing existing rpms first..."
rm -f pentaho-pdi*rpm
fpm -s dir -t rpm \
  --name pentaho-pdi \
  --version $PDIVERSION \
  --depends jdk \
  --vendor 'me@thisdataguy.com' \
  --url 'https://github.com/pentaho/pentaho-kettle' \
  --description 'Pentaho pdi kettle' \
  --maintainer 'me@thisdataguy.com' \
  --license 'Apache 2.0' \
  --epoch 1 \
  --directories $PDIROOT/$PDIDIR \
  --rpm-user $PDIUSER \
  --rpm-group $PDIUSER \
  --architecture all $CARTEOPTIONS \
  ./pentaho_pdi=/$PDIROOT \

rm -f after-install.sh

This will create a pentaho-pdi-${PDIVERSION}.noarch.rpm which you can just yum install or put it in your yum repositories.

pentaho pdi (kettle) carte init.d script

The following script is an LSB compliant init.d script allowing the carte webserver of pdi (kettle) to start at boot time.

Once the script is copied at /etc/init.d/carte you can use it the usual way:

service carte start
service carte status
service carte stop
service carte restart

To actually have carte started at boot time, register the script with chkconfig for redhat flavours:

chkconfig --add carte

or use update-rc.d carte defaults on debian flavours:

update-rc.d carte defaults

Note that under Redhat it uses the ‘/etc/rc.d/init.d/functions’ helper script which gives out nice colored output for [OK] or [FAILED]. If this script does not exist, a fallback is present, a bit less nice but working just as well. This means that in theory this script should work under all flavors of Linux.

There are a few configuration variables near the top of the script (user to run carte under, path to carte and port to listen to), but that is about it. You can find the script on github as well.

#!/bin/bash

# Start the carte server as a daemon, and helps managing it in a normal
# (service carte start/stop/status) way.

# Licence: FreeBSD, do what you want with it but do not hold me responsible.

### BEGIN INIT INFO
# Provides: pdi
# Required-Start: $network $syslog $remote_fs
# Required-Stop: $network $syslog
# Default-Start: 3 5
# Default-Stop: 0 1 2 4 6
# Short-Description: Pentaho Carte Server
# Description: Pentaho Carte Server
#
### END INIT INFO

## configuration directives
# Which user does carte run under?
PDIUSER=pentaho
# On which port should it listen?
CARTEPORT=80
# Where is pdi installed?
PDIROOT=/opt/pentaho_pdi
# Normal output log
LOGOUT=/var/log/pentaho_pdi.out.log
# Error output log
LOGERR=/var/log/pentaho_pdi.err.log

## script start here

# Note: The functions script is RH only. It is only used here for sexy (colored)
# output of [OK] or [FAILED] via echo_failure and echo_success.
#
# To make this script work under other flavors of linux, the 2 echo_ functions
# are first defined in a portable (but unsexy) way. If the RH functions script
# exists, its definition will override the portable way.
function echo_failure() { echo -en "\n[FAILED]"; }
function echo_success() { echo -en "\n[OK]"; }
[ -f /etc/rc.d/init.d/functions ] && source /etc/rc.d/init.d/functions

# Very useful for debugging
#set -x

# Find PID of the newest (-n) process owned by $PDIUSER (-u) with carte.sh on
# the full (-f) command, arguments included.
# => this should yield the pid of 'sh ./carte.sh' on STDOUT, with a status of 0
# if there is such a process, 1 otherwise
FINDPID="pgrep -u $PDIUSER -n -f carte.sh";
function _is_running() {
    $FINDPID 1>/dev/null
    return $?
}

function stop_carte() {
    _is_running
    if [ $? -ne 0 ]; then
        echo -n "$0 is not running, cannot stop."
        echo_failure
        echo
        return 1
    else
        echo -n "Stopping $0..."
        # Finding the pid of carte.sh from $FINDPID. Killing it would leave its
        # child, the actual java process, running.
        # Find this java process via ps and kill it.
        $FINDPID | xargs ps h -o pid --ppid | xargs kill
        sleep 1
        _is_running
        if [ $? -eq 0 ]; then
            echo_failure
            echo
            return 1
        else
            echo_success
            echo
            return 0
        fi
    fi

}

function status() {
    _is_running
    if [ $? -eq 0 ]; then
        echo -n "$0 is running."
        echo_success
        echo
        return 0
    else
        echo -n "$0 does not run."
        echo_failure
        echo
        return 1
    fi
}

function start_carte() {
    _is_running
    if [ $? -eq 0 ]; then
        echo -n "$0 already running."
        echo_failure
        echo
        return 1
    else
        echo -n "Starting $0..."
        # Make sure log files exist and are writable by $PDIUSER first
        touch $LOGOUT $LOGERR
        chown $PDIUSER:$PDIUSER $LOGOUT $LOGERR
        su - $PDIUSER -c "cd $PDIROOT && (nohup sh ./carte.sh $(hostname -i) $CARTEPORT 0<&- 1>>$LOGOUT 2>>$LOGERR &)"
        sleep 1
        _is_running
        if [ $? -eq 0 ]; then
            echo_success
            echo
            return 0
        else
            echo_failure
            echo
            return 1
        fi
    fi
}

case "$1" in
    start)
        start_carte
        exit $?
        ;;
    stop)
        stop_carte
        exit $?
        ;;
    reload|force-reload|restart|force-restart)
        stop_carte
        if [ $? -eq 0 ]; then
            start_carte
            exit $?
        else
            exit 1
        fi
        ;;
    status)
       status
       exit $?
       ;;
    *)
        echo "Usage: $0 {start|stop|restart|status}"
        exit 2
esac
exit 0

Clean pentaho shared connections from transformations and jobs

Pentaho has this nice shared.xml file, which can be found in your $HOME/.kettle repository. Once used, you can define all your connections there, in theory preventing duplicating connection definition in all jobs, and thus having one place only where to update your connections when needed.

The sad reality is that each time you save a job or a transformation, the connections are still always embedded in the job or transformation, effectively duplicating them. If you somehow remove the connection details from your job/transfo, the one from shared.xml will be used, which is what we want.

This ‘somehow’ can easily be achieved by the following snippet:

find . -type f -print0 | xargs -0 perl -0 -p -i -e 's/\s*<connection>\s*<.*?<\/connection>\s*$//smg'

We run it regularly on our codebase to keep it clean, and this always worked as expected.

Pentaho kettle: how to remotely execute a job with a file repository

Pentaho/kettle background

Kettle (now known as pdi) is a great ETL tool, opensource with a paid enterprise edition if you need extra support or plugin.

One great feature is the ability to remotely execute one of your jobs for testing, without having to deploy anything. This is done via the carte server (part of pdi), which basically is a service listening on a port to which you send your jobs.

Carte background

Carte works very well when you are using a database repository, but you will run into issues when you use a file repository. The reason is that when you run a job remotely, kettle needs to bundle all the relevant jobs and transformation to send them over. This is not always possible, an obvious example is if some job names are parametrised.

There is still a way to deal with this. Carte’s behaviour is to use the jobs/transformations sent by kettle, or to use the one it can find locally if the repository names match.

The solution

The solution is then quite logical: copy over your current repository to the carte server, set it up with the same name as your local repository and you are good to go.

This is a bit painful to do manually, so I give here the job I wrote to do that automatically from inside pentaho. There are not  a lot of assumptions done, except that you can copy file to your carte server with scp (you thus need ssh access).

The flow is as follow:

  1. Delete existing compressed local repository if any
  2. Compress local repository
  3. Delete remote compressed repository if any
  4. Copy over compressed local repository
  5. Uncompress remote compressed repository

You can see this in the following picture (the read arrows show the inside of the transformations):

Copy a local repository

Copy a local repository

To be generic, a few configuration values must be added to your kettle.properties. They set up the remote server name, your username, various paths. The following is an example with comments for all fields.

# Hostname of your etl server where carte runs
ssh.etlhost=etlserver.example.com
# Name of your ssh user
ssh.etluser=thisdwhguy
# Use one of ssf.password or shh.keypath + ssh.keypass
# password of your ssh uer, leave empty if none
ssh.password=
# Where is your private key on your local machine
ssh.keypath=/Users/thisdwhguy/.ssh/id_rsa
# If your private key is password protected, add it here.
# If not, leave it empty
ssh.keypass=
# Where does your repo sits
local.repo=/Users/thisdwhguy/pentaho
# Where to compress locally the repository
zip.tmpdir=/Users/thisdwh/tmp
# What is the name of your compressed repository
# (can be anything, this is irrelevant but having
# it here keeps consistency)
zip.devclone=devclone.zip
# Where to uncompress the zip file? This setup
# allows multiple users and the final directory
# will be ${ssh.etltargetdir}/${ssh.etltargetuserrepo}
ssh.etltargetdir=/path/to/repositories
ssh.etltargetuserrepo=thisdwhguy

Caveats

This job assumes that you have ssh access. If this is the case, you can use this job as is, but there is one thing you might want to update.

I assumed that a key is used for ssh, but a password might be the only thing you need. If that is the case, update the 2 ssh steps and the copy step accordingly by unticking ‘use private key’.

That’s all, folks

This job should be quite easy to use. Do not hesitate to comment if you have questions.

Sadly I cannot attach a zip file to this post, and after doing some over-enthusiastic cleaning I lost the example file completely. I hope that the description given in this post is enough.

Official fix

It looks like this workaround is not needed anymore, since this bug fix PDI-13774, available in version 5.4.0GA.