The real cost of puppet

I spent 2 months puppetising a set of servers that were already used in production, instead of doing it on the go as it should have been. After completion of the puppetisation, I then asked myself if spending these 2 months actually added value to the business worth more than 2 months of my time. This is what I try to analyse here.

To avoid adding up apples and bananas, I am giving to each item an actual economic ($$$) value. Finding out this value is actually the greatest part of the fun.

  1. What am I computing?
    1. What are the benefits associated to puppet?
    2. What are the costs associated with puppetising?
    3. Equation – first go
    4. Generalisation and simplification
    5. Benefit: Disaster recovery
    6. Benefit: satisfaction
  2. Time is money!
    1. My daily cost: Wages and related
    2. My daily added value
    3. Putting everything together
  3. Conclusion

What am I computing?

What are the benefits associated to puppet?

  • Knowledge sharing: when I quit or am fired (which happened to me way more often than being hit by a bus or a meteorite), everything I did is documented in the form of puppet manifests.
  • Scaling out: when I need to add a new server to a cluster, this can be done automatically instead of involving manual work.
  • Disaster recovery: if a server goes down and needs to be rebuilt, if my puppet-fu is good enough I just need to press a button to get a fully functional server instead of spending days to rebuild it.
  • Ease of update: changing one puppet manifest and apply it everywhere is easier than updating a whole cluster of servers.
  • Satisfaction: most tasks are fun to do once but boring when you need to do it over and over again.

After I finished this post, a coworker gave me some very interesting feedback. Another huge benefit is the ability to create new workflows. Combined with Openstack for instance, it is very easy to spawn a test instance to try an idea out. If it would take days or even only hours instead of minutes to have test servers ready, those ideas might never be tested. At this moment, I am not sure yet how to measure that, so I will have to keep it out the equation for now.

What are the costs associated with puppetising?

  • My wages and related expenses from my company.
  • Value I did not create while working on puppet.

Equation – first go

What I am trying to find out is if the benefits are greater than the cost. Mathematically, I am trying to find the sign of (benefits – costs). If this is positive, ie. greater than 0, I added value to the business. If not, maybe I should be fired.

benefits – costs =
(knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)

Generalisation and simplification

At the end of the day, most of those values can be matched to time spent doing something. I will thus start by simplifying the equation by reducing as much as possible in days, and will find the value of my days from there. Some values are one offs (knowledge sharing, puppetisation), some are recurrent. I am looking at one year worth of value, but am adding the Y variable to be able to see further in the future. Maybe I did not add enough value for next year, but enough if we look at the next 2 or 3 years. Same again, it might be that our current number of servers does not make puppetising valuable, but a bigger number might make it worthwhile. That’s why I introduced the S variable as well, S being the number of servers I manage.

  • Total time spent: 2 months are about 40 working days (counting week end, meetings, occasional day off and so on)
  • Knowledge sharing: I estimate that it would take about 6 days to understand the servers without documentation and to get the details mostly correct.
  • Scaling out can be summed up as time I would spend to install a server manually (0.5 days in average, with one outlier taking 2 days) multiplied by the number of times I need to install one of those server. This simplifies out to about 0.5 * 0.1 * S * Y ie. every year I increase our server farm by 10 percent. For instance, if I have 100 servers today, I expect to spend 5 days (10 servers @ 0.5 day per server) to scale out manually next year.
  • Update time is harder to compute as ansible does help a lot. It probably is not that relevant, but based on my experience it could save about 0.02 * S days a year. For instance, if I have 100 servers today, I expect to save 2 days with puppet compared to ansible.
  • My wages + value I create on a daily basis can be reduced to time spent (days), and will be computed later.

This gives us the next line of our equation:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.5 * 0.1 * S * Y days + disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days

Benefit: Disaster recovery

The three main questions here are:

  1. If this bit of the system goes down for one day, how much would it cost to the business?
  2. What is the probability that this bit of the system goes down?
  3. How long would it take to restore said system?

There is one very easy way to answer the first question. Just take said system down, hide for a day, and look at the reports the following day. Somehow, my coworkers were not thrilled with the idea. Luckily (or not), we had such an occurrence a few weeks back. No production server was impacted, but some predictive models and recommendations were not updated for 24 hours. This was the perfect data for me! It so happens that getting daily revenues is actually not so trivial, but I got my answer ($moneyloss). This is actually a low bound, as the servers being down might have rippled impact over a few days.

The probability that a bit of the system goes down can be computed from the MTBF. I will here use an average MTBF of 50000 hours. This value is a common value for servers, but does not take in account the rate at which data is written to disk for instance. A disk receiving a lot of data before exporting it to Hadoop would have a failure rate much higher than a disk less used. Same again, I assume that the failure rate is linear and do not care about the bathtub curve.

We already saw the cost of scaling out. I will estimate that disaster recovery cost twice this value, to handle the pressure, maybe finding more hardware and so on.

What does that give us?

In the next year (Y), each server will fail (365*24)/50000 times. Multiply that by the number of servers (S) and we are almost there. One server is actually still a single point of failure, and would create an outage if it is down. This give us: Y * (365*24)/50000 * S days +Y * (365*24)/50000 * money lost due to SPoF:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.5 * 0.1 * S * Y days+ disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + satisfaction) – 40 days

Benefit: satisfaction

Based on a paper by Alex Edmans from the London Business School: The Link Between Job Satisfaction and Firm Value, With Implications for Corporate Social Responsibility, a happy employee would perform about 3% better. This has been computed by comparing the growth of the best companies to work for versus the rest of the industry.  Well, puppetising does not make me that happy, but I will assume that I was happy for those 2 months (1/6th of the year), so I performed 1/6th better, and thus brought 0.05% extra.

The equation becomes:

benefits – costs
= (knowledge sharing + scaling out + disaster recovery + update + satisfaction) – (wages + value not created)
= (6 days + 0.1 * S * Y + Disaster recovery + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y  days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + satisfaction) – 40 days
= (6 days + 0.5 * 0.1 * S * Y days + ( S * Y * (365*24)/50000 days + Y * (365*24)/50000 * $moneyloss ) + 0.02 * S * Y days + 0.05 * 40 days) – 40 days

We now have almost all values in. By simplifying it, it gives us:

benefits – costs
= (0.3 * S * Y – 32) days + Y * 0.2 * $moneyloss

As expected, the result will depend on how long we look into the future and how much servers I am managing.

Time is money!

My daily cost: Wages and related

This is the easiest bit although even this computation is not straightforward. The cost of an employee cannot be measured only by their wages. Think about hiring costs, space (rent) for the desk, secondary benefits (pension, training), management cost (if my team did not exist, my manager would be useless and let go), equipment (computer, phone) and so on. This of course depends on your company, your type of contract, your country of work or your type of work.

The cost of an employee is often described as their base wages multiplied by a factor. This factor is usually considered being between 1.5 and 2.7. I will use a value of 2.

My daily added value

You can argue that I put those 2 months in one go, instead of spreading them over time if I had puppetised everything when it should have been done. This is true, but what I am trying to work out is if it was worth puppetising in the current context, meaning with everything already up and running in production.

So, which value do I bring to my company on a monthly basis?

This always a hard question to answer for a backend guy. I cannot claim to have closed a sale that brought $zillions, or created a model which generated a 25% uplift in revenues. Conversely, the sale guy or the analyst would not have been able to do their job without me doing mine (or so I want to believe at least). I thus went around to ask various people for their opinion.

My manager answered me that I above all bring knowledge, and this is priceless. When I then argued that as I am priceless I should get a big raise, I sadly only got a resounding ‘no’. This makes me believe that I am, in fact, not priceless and that my manager does not believe it either. Other answers were that I am worth exactly what I negotiated as salary package, because this is how the CFO sees me: this dwh guy in the corner with a price tag on his forehead. Another answer was that from a pure money viewpoint, we should only use on-demand consultants, which we bring in only when we need them. None of those answers put a value on my worth, only on what I cost, so I carried on asking.

The best theoritical answer I got was from analyst. Basically, I ‘just’ need to create model where the main KPI ($€£) is put in relation with my start date (and ideally my quitting date). Other employees are confounding factors which must be handled by the model. This is not easy and would be a project of its own, for which I already know that the value generated would not exceed the cost incurred.

I eventually used $moneyloss (daily money lost if the servers I managed are down), and divided this by the number of people in the team, to give $myvalue.

Conversely, if I use the total revenue from my company divided by the number of total employees, I end with a value 3 times as big. This either means that my team does not perform that well, either that we provide value in hidden ways, by helping other people make better decisions, for instance.

Putting everything together

benefits – costs
= (0.3 * S * Y – 32) days + Y * 0.2 * $moneyloss
=(0.3 * S * Y – 32) * (2 * $wages + $myvalue) + Y * 0.2 * $moneyloss

Now is the exciting thing. Putting all the numbers together for one year, I end up with…

benefits < costs

Ah. Maybe I should be fired… Luckily, this is for year one only. This levels out on year 2, and it starts generating value after that. This is very similar for the whole possible range for $myvalue.

Conclusion

I did try to base all this computations on actual data. There are of course some assumptions, some rounding that should probably not be there (mtbf), some data are very hard to compute (my daily value), some would deserve better research (cost of one day down). But the summary is that yes, puppet is valuable. It was a specific case for me as this was the first time I actually used puppet, so there was a big learning curve and if I had to do it again now I would be a lot faster.

I learned as well a lot about my value, my cost, the complexity to actually find them out and how the company’s revenues react to mistakes I make. This was a real eye opener. It is  of course not a secret that knowing these numbers helps making better decisions. If I had those numbers for every aspects of my company’s platform, prioritising would be a lot easier.

This was quite a fun post to write, which took way longer than expected as I discovered ramifications along the way. I can only encourage you to do the same, and think about what you would do if you find out the value you bring is exactly 0€/day? exactly equals to your wages? 10 times your wages? 1 million€/day?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s