Replacing a single mongoDB server

I am moving a single mongoDB server to another hardware, and I want to do that with the least possible production interruption, of course.

Well, it so happens that it is not possible if you did not plan it from the start. You can argue that if I have a single SPOF server in production I am doing my job badly, but this is beside the point for this post.

MongoDB has this neat replication features, where you can build a cluster of servers, with one primary and a few slaves (secondaries), among other options. If you properly configured mongo to use this feature, then you can add a secondary, promote it to primary to eventually switch off the initial primary. This is what I will describe here.

Note that there will be 2 (very short) downtimes. One to create a replica set (this is just a restart), and one where the primaries are switched (you need to redirect connections to your new primary).

A note about vagrant

If you are using vagrant, make sure that you use the plugin vagrant-hostmanager (vagrant plugin install vagrant-hostmanager) which helps managing /etc/hosts from inside vagrant boxes. Furthermore, make sure you set a different hostname to each of your VMs. By default, if you use the same basebox, they will probably end up having the same hostname (config.vm.hostname in your vagrant file, or the more specific version if you define a cluster inside your vagrantfile).

Configure the replication set

First of all, you need to tell mongoDB to use the replication feature. If not you will end up with messages like:

> rs.initiate()
{ "ok" : 0, "errmsg" : "server is not running with --replSet" }

You just need to update your /etc/mongodb.conf to add a line like so:

replSet=spof

This is the config option that enables the replication. All servers in your replica set will have the same option, with the same value.

On a side note, in the same file make sure you are not binding only to 127.0.0.1, or you will have trouble having your 2 mongo instances talking to each other.

The sad thing is that mongo cannot reload its config file:

root@debian-800-jessie:~# service mongodb reload
[warn] Reloading mongodb daemon: not implemented, as the daemon ... (warning).
[warn] cannot re-read the config file (use restart). ... (warning).

Right. So a restart is needed:

service mongodb restart

You can now connect to your mongo shell the usual way, and initialise the replica set. This follows part of the tutorial explaining how to convert a standalone server to a replica set. Just type in the mongo shell:

rs.initiate()

and the (1 machine) replica set is now operational.

You can check this easily:

> rs.conf()
{
  "_id" : "spof",
  "version" : 1,
  "members" : [
  {
  "_id" : 0,
  "host" : "debian-800-jessie:27017"
 }
 ]
}

On your (new with an empty mongo) server, make sure that you add the replSet line as well in mongodb.conf.

Note that the hostname must be resolvable on the other machine of the cluster. If mongoDB somehow picked the local hostname, your replica set will just not work. If the local hostname has been picked up, see option 2 (reconfig) below.

You are now ready to add the second server to the set.

spof:PRIMARY> rs.add("server2.example.com:27017")
{ "ok" : 1 }

We can check that all is fine:

spof:PRIMARY> rs.conf()
{
   "_id" : "spof",
   "version" : 3,
   "members" : [
   {
       "_id" : 0,
       "host" : "debian-800-jessie:27017"
   },
   {
       "_id" : 1,
       "host" : "server2.example.com:27017"
   },
   ]
}

If the local hostname was chosen, the easiest option is to fully reconfigure the replica set from within the mongo shell, based on your current configuration :

// Get current configuration object
cfg=rs.conf()
// update the current machine to use the non local name
cfg.members[0].host="server1.example.com"
// fully add server 2
cfg.members[1]={"_id":1, host:"server2.example.com"}
// use this new config
rs.reconfig(cfg)

Ok, we are all good, and the replica set is properly set up. What happened on our server2 which was empty when we started?

on server2:

spof:SECONDARY> use events
spof:SECONDARY> db.events.find()
error: { "$err" : "not master and slaveOk=false", "code" : 13435 }

Hum, what does that mean? In short, there is a replication delay between the primary and secondary, so by default mongo disables reads to the secondary to make sure you always read up to date data. You can read more about read preferences, but to tell mongo that yes, you know what you are doing, just issue the slaveOK() command:

spof:SECONDARY> rs.slaveOk()
spof:SECONDARY> db.events.find()
{ "_id" : ObjectId("556d550b59a5fb8615044c72"), "name" : "relevant" }

Success! (In this vagrant example, there was only one document in the collection).

In real life, if the secondary needs to sync a lot of data, it will stay in state STARTUP2 for a long time, which you can see via rs.status(). In the log files of the new secondary, you can see progress per collections. It will then move to RECOVERING to finally become SECONDARY, which is when it will start accepting connections.

Switch primaries

We are all set, you waited long enough to have the secondary in sync with the primary. What now? We first need to switch primary and secondary roles. This can be done easily by changing the priorities:

spof:PRIMARY> cfg=rs.conf()
spof:PRIMARY> cfg.members[0].priority=0.5
0.5
spof:PRIMARY> cfg.members[1].priority=1
1
spof:PRIMARY> rs.reconfig(cfg)
spof:SECONDARY>

As you can see, your prompt changed from primary to secondary.

From this moment on, all connections to your now secondary should succeed but you will not be able to do much (secondary cannot write, and remember slaveOk()). You must thus be sure that your client connect to the new primary, or that you know that the connection is readonly in which case you can use slaveOk(). This switchover will be your last downtime.

Clean up

you can tell your new master that the secondary is not needed anymore:

rs.remove('k1.wp:27017')

Note that if you switch the secondary off (service mongodb stop), then the primary will step down to secondary as well, as it cannot guarantee that it is in a coherent state. This is what you get from using a replica set with only 2 machines.

You can now dispose of your old primary as you wish.

If you want to play around with your old primary, you will be out of luck to start with:

"not master or secondary; cannot currently read from this replSet member"

It will of course be obvious that you need to remove the replSet value from mongodb.conf and restart the server. Sadly, you will then be greeted by another, longer message when you connect:

Server has startup warnings: 
Wed Jun 3 13:10:44.435 [initandlisten] 
Wed Jun 3 13:10:44.435 [initandlisten] ** WARNING: mongod started without --replSet yet 1 documents are present in local.system.replset
Wed Jun 3 13:10:44.435 [initandlisten] ** Restart with --replSet unless you are doing maintenance and no other clients are connected.
Wed Jun 3 13:10:44.435 [initandlisten] ** The TTL collection monitor will not start because of this.
Wed Jun 3 13:10:44.435 [initandlisten] ** For more info see http://dochub.mongodb.org/core/ttlcollections
Wed Jun 3 13:10:44.435 [initandlisten]

Well, the solution is almost obvious from the error message. If there is a document in local.system.replset, let’s just remove it!

> use local
switched to db local
> db.system.replset.find()
{ "_id" : "spof", "version" : 4, "members" : [ { "_id" : 1, "host" : "server1.example.com:27017" } ] }
> db.system.replset.remove()
> db.system.replset.find()
>

Once you exit and reconnect to mongoDB, all will be fine, and will have your nice standalone server back.

Advertisements

Disco: replicate all data away from a blacklisted node

Before removing a node from Disco, the right way to do it is to blacklist it, replicate away its data, and finally remove it form the cluster. Replicating data away is done by running the garbage collector. Unfortunately, the garbage collector does not migrate everything in one go, so a few runs are needed. To not have to do this manually, the following script will run the garbage collector as often as needed as long as some nodes are blacklisted but not yet safe for removal. The full script can be found on github.

#!/usr/bin/env bash

# Will check if there are nodes blacklisted for ddfs but not fully replicated yet.
# If this is the case, it will run the GC as long as all data is not replicated away.

# debug
#set -x

# Treat unset variables as an error when substituting.
set -u

# master
HOST=disco.example.com:8989
DDFS_URL="http://$HOST/ddfs/ctrl"
DISCO_URL="http://$HOST/disco/ctrl"

# API commands
GC_STATUS=$DDFS_URL/gc_status
GC_START=$DDFS_URL/gc_start
BLACKLIST=$DISCO_URL/get_gc_blacklist
SAFE_GC=$DDFS_URL/safe_gc_blacklist

# counter to mark how many times the GC ran.
CNT=0

function is_running {
    # will get "" if GC not running, or a string describing the current status.
    _GC_RES=$(wget -q -O- $GC_STATUS)
    if [ "$_GC_RES" == '""' ]
    then
        _GC_RES=''
    fi
    echo $_GC_RES
}

function is_safe {
    _BLACKLISTED=$(wget -q -O- $BLACKLIST)
    _SAFE=$(wget -q -O- $SAFE_GC)

    # eg.
    # blacklisted:  ["slave1","slave2","slave3"]
    # safe_gc_blacklist: []

    # safe is a subset of get. If we concat the 2 (de-jsonised) and get uniques, we have 2 cases:
    # - no uniques => all nodes are safe (in blacklist *and* in safe)
    # - uniques => some nodes are not safe

    echo "$_BLACKLISTED $_SAFE" | tr -d '[]"' | tr ', ' '\n' | sort | uniq -u
}

while true
do

    GC_RES=$(is_running)

    if [ -z "$GC_RES" ]
    then
        echo "GC not running, let's check if it is needed."
        NON_SAFE=$(is_safe)
        if [ -z "$NON_SAFE" ]
        then
            echo "All nodes are safe for removal."
            exit
        else
            echo "Somes nodes are not yet safe: $NON_SAFE"
            CNT=$((CNT+1))
            date +'%Y-%m-%d %H:%M:%S'
            wget -q -O /dev/null $GC_START
            echo "Run $CNT started."
        fi
    else
        echo "GC running ($GC_RES). Let's wait".
    fi
    sleep 60
done