New puppet apt module, now with better hiera!

The puppet apt module from puppetlabs works great, but has one big issue. You can define sources (and keys, settings and ppas) from hiera, but only your most specific definition will be used by the module, as only the default priority lookup is done. This means a lot of cut & paste if you want to manage apt settings across multiple environments or server roles. This is known, but will not be fixed as it is apparently by design.

Well, this design did not fit me, so I forked the puppetlabs module, updated it to enable proper hiera_hash look up, and published it to the puppet forge. There is no more difference with the original, but it does simplify my life a lot. Now if you define multiple sources in your hierarchy, for instance at datacenter level:

apt::sources:
 'localRepo':
   location: 'https://repo.example.com/debian'
   release: '%{::lsbdistcodename}'
   repos: 'main contrib non-free'

and at server level:

apt::sources:
  'puppetlabs':
    location: 'http://apt.puppetlabs.com'
    repos: 'main'
    key:
      id: '47B320EB4C7C375AA9DAE1A01054B7A24BD6EC30'
      server: 'pgp.mit.edu'

you will nicely have both sources in sources.list.d, instead of only having the one defined at server level.

You can find the source on github, and can download the module from the puppet forge. Installing it is as simple as:

puppet module install lomignet-apt

Puppet error messages and solutions

This is a collection of error messages I got while setting up a puppet infrastructure and testing modules, as well as their reasons ans solutions.

Failed to load library ‘msgpack’

Message

On an agent:

Debug: Failed to load library 'msgpack' for feature 'msgpack'
Debug: Puppet::Network::Format[msgpack]: feature msgpack is missing

on the master:

Debug: Puppet::Network::Format[msgpack]: feature msgpack is missing
Debug: file_metadata supports formats: pson b64_zlib_yaml yaml raw

Context

This happens when you run puppet agent, for instance.

Reason

msgpack is an efficient serialisation format. Puppet uses is (experimentally) when communicating between master and agent. This format requires a gem, which if not installed will give this debug message. This is completely harmless, it just pollutes your logs.

Fix

Just install the msgpack ruby gem. Depending on your system, you can

#debian based:
apt-get install ruby-msgpack
#generic
gem install msgpack

This immediately removes the debug messages. To actually use msgpack, you need to add in the [main] or [agent] section of puppet.conf the line:

preferred_serialization_format =  msgpack

Could not retrieve information from environment

Message

Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve information from environment $yourenvironment source(s) puppet://localhost/pluginfacts

Context

Puppet agent run.

Reason

If no module has a facts.d folder, puppet will throw this error. This is an actual bug in puppet, at least version 3.7.3.

Fix

Option 1: Just discard. This is shown as an error, but has no impact and the run will carry on uninterrupted.

Option 2: actually create a facts.d folder in a module.

Could not find data item classes in any Hiera data file

Message

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item classes in any Hiera data file and no default supplied on node aws-nl-puppetmaster.dp.webpower.io

Context

Your puppet tests were all working fine from vagrant. You just installed a puppet master and the first agent run gives you this error.

Reason

Check your hiera.yaml file (in /etc/puppet/hiera.yaml, /etc/puppetlabs/puppet/hiera.yaml or pointed by hiera_config from your puppet.conf). There is a :datadir section, telling puppet where to find hiera data. If the path there is absolute, then it should directly point to the directory. If it is relative, then it works only under vagrant and is based on puppet.working_dir.

Fix

Many options are possible.

  • Use a common absolute path everywhere.
  • Put the directory, maybe via a link, in its default location.
  • Puppet can interpolate variables when reading datadir, so if your issue is due to different environments, you could use a path like
    “/var/hieradata/%{::environment}/configuration”

Note that if you change hiera.yaml, you need to reload the puppet master as hiera.yaml is only read at startup.

No such file or directory @ dir_s_rmdir

Message

Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed when searching for node $nodename: No such file or directory @ dir_s_rmdir - /var/puppet/hiera/node/$nodename.yaml20150812-5415-1802nxn.lock

Context

Puppet agent run

Reason

  1. Puppet master tries to create a file in a directory he does not own, and has thus no permission.
  2. Puppet tries to create a file or directory whereas the parent does not exists.
  3. The partition where puppet tries to create a lock file is full.

Fix

  1. With the path given in the example error message:
    chown -R puppet:puppet /var/puppet/hiera
  2. Make sure the parent is created in the manifest as well

This ‘if’ statement is not productive

Message

 This 'if' statement is not productive.

followed by some more explnation depending on the context.

Context

Puppet agent run

Reason

Puppet does not want to leave alone, and pretends to know better than me. I might want to have a if (false) {…} or if (condition) {empty block} for whatever reasons, but no, puppet very violently and rudely bails out. There is a bug discussion about it as well as a fix to change the wording, but the behaviour will stay.

Fix

Comment out what puppet does not like.

sslv3 alert certificate revoked or certificate verify failed

Message

SSL_connect returned=1 errno=0 state=SSLv3 read server session ticket A: sslv3 alert certificate revoked

or

Wrapped exception:
SSL_connect returned=1 errno=0 state=unknown state: certificate verify failed: [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster.example.com]

Context

Puppet agent run

Reason

You probably revoked or clean certificates on the puppet master, but did not inform the agent about it. or maybe you are now pointing to a new puppetmaster.

Fix

You can fix this by cleaning the agent as well:

sudo rm -r /etc/puppet/ssl
sudo rm -r /var/lib/puppet/ssl

Puppet and virtual resources tutorial to manage user accounts

Virtual resources are a very powerful and not well understood feature of puppet. I will here explain what they are and why there are useful, using as example the management of users in puppet.

By default, in puppet, a resource may be specified only once. The typical example when this can be hurtful is when a user needs to be created on for instance the database and web servers. This user can be only defined once, not once in the database class and once in the webserver class.

If you were to define this user as a virtual resource, then you can define them in multiple places without issue. The caveat is that as the name suggests this user is virtual only, and is not actually created on the server. Some extra work is needed to create (realize in puppet-speak) the user.

Data structure and definitions

Jump to the next section if you directly want to go to the meat of the post. I still want to detail the data structure for better visualisation.

The full example can be found on github. The goal is to be able to define users with the following criteria and assumptions:

  • User definition is centralised in one place (typically common.yaml). A defined user on hiera does not mean that they are created on any server, it must be explicitly required.
  • A user might be ‘normal’ or have sudo rights. Sudo rights mean that they can do whatever they wishes, passwordless. There is no finer granularity.
  • A user might be normal on a server, sudo on another one, absent on others. This can be defined anywhere in the hiera hierarchy.

As good practice, all can be done via hiera. A user can be defined so, with simple basic properties:

accounts::config::users:
  name:
    # List of roles the user belongs to. Not necessarily matched to linux groups
    # They will be used in user::config::{normal,super} in node yaml files to
    # decide which users are present on a server, and which ones have sudo allowed.
    # Note that all users are part of 'all' groups
    roles: ['warrior', 'priest', 'orc']
    # default: bash
    shell: "/bin/zsh"
    # already hashed password.
    # https://thisdataguy.com/2014/06/10/understand-and-generate-unix-passwords
    # python -c 'import crypt; print crypt.crypt("passwerd", "$6$some_random_salt")'
    # empty/absent means no login via password allowed (other means possible)
    pass: '$6$pepper$P9Wt3.3Uqh9UZbvz5/6UPtHqa4KE/2aeyeXbKm0mpv36Z5aCBv0OQEZ1e.aKcPR6RBYvQIa/ToAfdUX6HjEOL1'
    # A PUBLIC rsa key.
    # Empty/absent means not key login allowed (other means possible)
    sshkey: 'a valid public ssh key string'

Roles here have no direct Linux counterpart, they have nothing to do with linux groups.
They are only an easy way to manage users inside hiera. You can for instance say
that all system administrators belong to the role sysops, and grant sudo to the sysops group everywhere in one go.

Roles can be added at will, and are just a string tag. Role names will be used later to actually select and create users.

To then actually have users created on a server, roles must be added to 2 specific configuration arrays, depending if a role must have sudo rights or not.  Note that all values added to these arrays are merged along the hierarchy, meaning that you can add users to specific servers in the node definition.

For instance, if in common.yaml we have:

accounts::config::sudo: ['sysadmin']
accounts::config::normal: ['data']

and in a specific node definition (say a mongo server)  we have:

accounts::config::sudo: ['data']
accounts::config::normal: ['deployer']

– all sysadmin users will be everywhere, with sudo
– all data users will be everywhere, without sudo
– all data users will have the extra sudo rights on the mongo server
– all deployer users will be on the mongo server only, without sudo

Very well, but to the point please!

So, why do we have a problem that cannot be resolved by usual resources?

  • I want the user definition to be done in one place (ie. one class) only
  • I would like to avoid manipulate data outside puppet (not in a ruby library)
  • If a user ends up being normal and sudo in a server, declaring them twice will not be possible

How does this work?

Look at the normal.pp manifest, Unfortunately, the sudo.pp manifest duplicates it almost exactly. The reasons is ordering and duplication of definition of the roles resource. This is a detail.

Looking at the file, here are the interesting parts. First accounts::normal::virtual

class accounts::normal { 
  ...
  define virtual() {...}
  create_resources('@accounts::normal::virtual', $users)
  ...
}

This defines a virtual resource (note the @ in front of the resource name on the create_resources line), which is called for each and every element of $users. Note that as it is a virtual resource, users will not actually be created (yet).

The second parameter to create_resources() needs to be a hash. Keys will be resource titles, attributes will be resource parameters. Luckily, this is exactly how we defined users in hiera!

This resource actually does not do much, it just calls the actual user creating resource, called Accounts::VirtualAccounts::Virtual is a virtual resource, used as you would call any other puppet resource:

resource_name {title: attributes_key => attribute_value}

This is how the resource is realised. As said above, creating a virtual resource (virtual users in our case) does not automatically create the user. By calling it directly, the user is finally created:

accounts::virtual{$title:
 pass   => $pass,
 shell  => $shell,
 sshkey => $sshkey,
 sudo   => false
}

Note the conditional statement just before:

unless defined (Accounts::Virtual[$title]) { ... }

In my design, there is no specific sudoer resource. The sudoer file is managed as part as the user resource. This means that if a user is found twice, once as normal and once as sudo, the same user resource could be declared twice. As the sudo users are managed before the normal users, we can check if the user has already been defined. If that’s the case, the resource will not be called a second time.

This is all and well, but how is the accounts::normal::virtual resource called? Via another resource, of course! This is what roles (accounts::normal::roles) does:

define roles($type) { ... }
create_resources('accounts::normal::roles', $normal)

Notice the difference in create_resources? There is no @ prefix in the resource name. This means that this resource is directly called with $normal as parameter, and is not virtual.

Note the $normal parameter. It is just some fudge to translate an array (list of role to create as normal user) to a hash, which is what create_resources() requires.

Inside account::normal::roles, we found the nicely named spaceship operator. Its role will be to realise a bunch resources, but only a subset of them. You can indeed give a filter parameter. In our case (forgetting the ‘all’ conditional, which is just fudging to handle a non explicit group), you can see its use to filter on roles:

 Accounts::Normal::Virtual <| roles == $title |>

What this says is simply that we realise the resources Accounts::Normal::Virtual, but only for users having the value $title in their roles array.

To sum up, here is what happened in pseudo code

  • for each role as $role (done directly in a class)
    • for each user as $user (done in the role resource)
      • apply the resource virtual user (done in the virtual user resource)

Easy, no?

The real cost of puppet

I spent 2 months puppetising a set of servers that were already used in production, instead of doing it on the go as it should have been. After completion of the puppetisation, I then asked myself if spending these 2 months actually added value to the business worth more than 2 months of my time. This is what I try to analyse here.

To avoid adding up apples and bananas, I am giving to each item an actual economic ($$$) value. Finding out this value is actually the greatest part of the fun.

  1. What am I computing?
    1. What are the benefits associated to puppet?
    2. What are the costs associated with puppetising?
    3. Equation – first go
    4. Generalisation and simplification
    5. Benefit: Disaster recovery
    6. Benefit: satisfaction
  2. Time is money!
    1. My daily cost: Wages and related
    2. My daily added value
    3. Putting everything together
  3. Conclusion

What am I computing?

What are the benefits associated to puppet?

  • Knowledge sharing: when I quit or am fired (which happened to me way more often than being hit by a bus or a meteorite), everything I did is documented in the form of puppet manifests.
  • Scaling out: when I need to add a new server to a cluster, this can be done automatically instead of involving manual work.
  • Disaster recovery: if a server goes down and needs to be rebuilt, if my puppet-fu is good enough I just need to press a button to get a fully functional server instead of spending days to rebuild it.
  • Ease of update: changing one puppet manifest and apply it everywhere is easier than updating a whole cluster of servers.
  • Satisfaction: most tasks are fun to do once but boring when you need to do it over and over again.

After I finished this post, a coworker gave me some very interesting feedback. Another huge benefit is the ability to create new workflows. Combined with Openstack for instance, it is very easy to spawn a test instance to try an idea out. If it would take days or even only hours instead of minutes to have test servers ready, those ideas might never be tested. At this moment, I am not sure yet how to measure that, so I will have to keep it out the equation for now.

What are the costs associated with puppetising?

  • My wages and related expenses from my company.
  • Value I did not create while working on puppet.

Equation – first go

What I am trying to find out is if the benefits are greater than the cost. Mathematically, I am trying to find the sign of (benefits – costs). If this is positive, ie. greater than 0, I added value to the business. If not, maybe I should be fired.

benefits – costs =
(knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)

Generalisation and simplification

At the end of the day, most of those values can be matched to time spent doing something. I will thus start by simplifying the equation by reducing as much as possible in days, and will find the value of my days from there. Some values are one offs (knowledge sharing, puppetisation), some are recurrent. I am looking at one year worth of value, but am adding the Y variable to be able to see further in the future. Maybe I did not add enough value for next year, but enough if we look at the next 2 or 3 years. Same again, it might be that our current number of servers does not make puppetising valuable, but a bigger number might make it worthwhile. That’s why I introduced the S variable as well, S being the number of servers I manage.

  • Total time spent: 2 months are about 40 working days (counting week end, meetings, occasional day off and so on)
  • Knowledge sharing: I estimate that it would take about 6 days to understand the servers without documentation and to get the details mostly correct.
  • Scaling out can be summed up as time I would spend to install a server manually (0.5 days in average, with one outlier taking 2 days) multiplied by the number of times I need to install one of those server. This simplifies out to about 0.5 * 0.1 * S * Y ie. every year I increase our server farm by 10 percent. For instance, if I have 100 servers today, I expect to spend 5 days (10 servers @ 0.5 day per server) to scale out manually next year.
  • Update time is harder to compute as ansible does help a lot. It probably is not that relevant, but based on my experience it could save about 0.02 * S days a year. For instance, if I have 100 servers today, I expect to save 2 days with puppet compared to ansible.
  • My wages + value I create on a daily basis can be reduced to time spent (days), and will be computed later.

This gives us the next line of our equation:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.5 * 0.1 * S * Y days + disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days

Benefit: Disaster recovery

The three main questions here are:

  1. If this bit of the system goes down for one day, how much would it cost to the business?
  2. What is the probability that this bit of the system goes down?
  3. How long would it take to restore said system?

There is one very easy way to answer the first question. Just take said system down, hide for a day, and look at the reports the following day. Somehow, my coworkers were not thrilled with the idea. Luckily (or not), we had such an occurrence a few weeks back. No production server was impacted, but some predictive models and recommendations were not updated for 24 hours. This was the perfect data for me! It so happens that getting daily revenues is actually not so trivial, but I got my answer ($moneyloss). This is actually a low bound, as the servers being down might have rippled impact over a few days.

The probability that a bit of the system goes down can be computed from the MTBF. I will here use an average MTBF of 50000 hours. This value is a common value for servers, but does not take in account the rate at which data is written to disk for instance. A disk receiving a lot of data before exporting it to Hadoop would have a failure rate much higher than a disk less used. Same again, I assume that the failure rate is linear and do not care about the bathtub curve.

We already saw the cost of scaling out. I will estimate that disaster recovery cost twice this value, to handle the pressure, maybe finding more hardware and so on.

What does that give us?

In the next year (Y), each server will fail (365*24)/50000 times. Multiply that by the number of servers (S) and we are almost there. One server is actually still a single point of failure, and would create an outage if it is down. This give us: Y * (365*24)/50000 * S days +Y * (365*24)/50000 * money lost due to SPoF:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.5 * 0.1 * S * Y days+ disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + satisfaction) – 40 days

Benefit: satisfaction

Based on a paper by Alex Edmans from the London Business School: The Link Between Job Satisfaction and Firm Value, With Implications for Corporate Social Responsibility, a happy employee would perform about 3% better. This has been computed by comparing the growth of the best companies to work for versus the rest of the industry.  Well, puppetising does not make me that happy, but I will assume that I was happy for those 2 months (1/6th of the year), so I performed 1/6th better, and thus brought 0.05% extra.

The equation becomes:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.1 * S * Y + Disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y  days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + 0.05 * 40 days) – 40 days

We now have almost all values in. By simplifying it, it gives us:

benefits – costs
= (0.3 * S * Y – 32) days + Y * 0.2 * $moneyloss

As expected, the result will depend on how long we look into the future and how much servers I am managing.

Time is money!

My daily cost: Wages and related

This is the easiest bit although even this computation is not straightforward. The cost of an employee cannot be measured only by their wages. Think about hiring costs, space (rent) for the desk, secondary benefits (pension, training), management cost (if my team did not exist, my manager would be useless and let go), equipment (computer, phone) and so on. This of course depends on your company, your type of contract, your country of work or your type of work.

The cost of an employee is often described as their base wages multiplied by a factor. This factor is usually considered being between 1.5 and 2.7. I will use a value of 2.

My daily added value

You can argue that I put those 2 months in one go, instead of spreading them over time if I had puppetised everything when it should have been done. This is true, but what I am trying to work out is if it was worth puppetising in the current context, meaning with everything already up and running in production.

So, which value do I bring to my company on a monthly basis?

This always a hard question to answer for a backend guy. I cannot claim to have closed a sale that brought $zillions, or created a model which generated a 25% uplift in revenues. Conversely, the sale guy or the analyst would not have been able to do their job without me doing mine (or so I want to believe at least). I thus went around to ask various people for their opinion.

My manager answered me that I above all bring knowledge, and this is priceless. When I then argued that as I am priceless I should get a big raise, I sadly only got a resounding ‘no’. This makes me believe that I am, in fact, not priceless and that my manager does not believe it either. Other answers were that I am worth exactly what I negotiated as salary package, because this is how the CFO sees me: this dwh guy in the corner with a price tag on his forehead. Another answer was that from a pure money viewpoint, we should only use on-demand consultants, which we bring in only when we need them. None of those answers put a value on my worth, only on what I cost, so I carried on asking.

The best theoritical answer I got was from analyst. Basically, I ‘just’ need to create model where the main KPI ($€£) is put in relation with my start date (and ideally my quitting date). Other employees are confounding factors which must be handled by the model. This is not easy and would be a project of its own, for which I already know that the value generated would not exceed the cost incurred.

I eventually used $moneyloss (daily money lost if the servers I managed are down), and divided this by the number of people in the team, to give $myvalue.

Conversely, if I use the total revenue from my company divided by the number of total employees, I end with a value 3 times as big. This either means that my team does not perform that well, either that we provide value in hidden ways, by helping other people make better decisions, for instance.

Putting everything together

benefits – costs
= (0.3 * S * Y – 32) days + Y * 0.2 * $moneyloss
=(0.3 * S * Y – 32) * (2 * $wages + $myvalue) + Y * 0.2 * $moneyloss

Now is the exciting thing. Putting all the numbers together for one year, I end up with…

benefits < costs

Ah. Maybe I should be fired… Luckily, this is for year one only. This levels out on year 2, and it starts generating value after that. This is very similar for the whole possible range for $myvalue.

Conclusion

I did try to base all this computations on actual data. There are of course some assumptions, some rounding that should probably not be there (mtbf), some data are very hard to compute (my daily value), some would deserve better research (cost of one day down). But the summary is that yes, puppet is valuable. It was a specific case for me as this was the first time I actually used puppet, so there was a big learning curve and if I had to do it again now I would be a lot faster.

I learned as well a lot about my value, my cost, the complexity to actually find them out and how the company’s revenues react to mistakes I make. This was a real eye opener. It is  of course not a secret that knowing these numbers helps making better decisions. If I had those numbers for every aspects of my company’s platform, prioritising would be a lot easier.

This was quite a fun post to write, which took way longer than expected as I discovered ramifications along the way. I can only encourage you to do the same, and think about what you would do if you find out the value you bring is exactly 0€/day? exactly equals to your wages? 10 times your wages? 1 million€/day?