Disco: purge all completed jobs

In Disco, jobs not purged still use some disk space, for temporary data. This can lead to fully using all the disc space in your cluster. I have been there, and it is not fun I promise you.

You can setup some purge policy (a job might purge itself after completion, for instance), but if you need to quickly clean up all jobs, this snippet will help. It will purge all completed jobs, successful or not.

for job in $(disco jobs);
do
  disco events $job | grep "WARN: Job killed\|READY";
  if [ $? -eq 0 ];
  then
    disco purge $job;
  fi;
done
Advertisements

Disco: replicate all data away from a blacklisted node

Before removing a node from Disco, the right way to do it is to blacklist it, replicate away its data, and finally remove it form the cluster. Replicating data away is done by running the garbage collector. Unfortunately, the garbage collector does not migrate everything in one go, so a few runs are needed. To not have to do this manually, the following script will run the garbage collector as often as needed as long as some nodes are blacklisted but not yet safe for removal. The full script can be found on github.

#!/usr/bin/env bash

# Will check if there are nodes blacklisted for ddfs but not fully replicated yet.
# If this is the case, it will run the GC as long as all data is not replicated away.

# debug
#set -x

# Treat unset variables as an error when substituting.
set -u

# master
HOST=disco.example.com:8989
DDFS_URL="http://$HOST/ddfs/ctrl"
DISCO_URL="http://$HOST/disco/ctrl"

# API commands
GC_STATUS=$DDFS_URL/gc_status
GC_START=$DDFS_URL/gc_start
BLACKLIST=$DISCO_URL/get_gc_blacklist
SAFE_GC=$DDFS_URL/safe_gc_blacklist

# counter to mark how many times the GC ran.
CNT=0

function is_running {
    # will get "" if GC not running, or a string describing the current status.
    _GC_RES=$(wget -q -O- $GC_STATUS)
    if [ "$_GC_RES" == '""' ]
    then
        _GC_RES=''
    fi
    echo $_GC_RES
}

function is_safe {
    _BLACKLISTED=$(wget -q -O- $BLACKLIST)
    _SAFE=$(wget -q -O- $SAFE_GC)

    # eg.
    # blacklisted:  ["slave1","slave2","slave3"]
    # safe_gc_blacklist: []

    # safe is a subset of get. If we concat the 2 (de-jsonised) and get uniques, we have 2 cases:
    # - no uniques => all nodes are safe (in blacklist *and* in safe)
    # - uniques => some nodes are not safe

    echo "$_BLACKLISTED $_SAFE" | tr -d '[]"' | tr ', ' '\n' | sort | uniq -u
}

while true
do

    GC_RES=$(is_running)

    if [ -z "$GC_RES" ]
    then
        echo "GC not running, let's check if it is needed."
        NON_SAFE=$(is_safe)
        if [ -z "$NON_SAFE" ]
        then
            echo "All nodes are safe for removal."
            exit
        else
            echo "Somes nodes are not yet safe: $NON_SAFE"
            CNT=$((CNT+1))
            date +'%Y-%m-%d %H:%M:%S'
            wget -q -O /dev/null $GC_START
            echo "Run $CNT started."
        fi
    else
        echo "GC running ($GC_RES). Let's wait".
    fi
    sleep 60
done