Just a little semantic reminder before we dive in:
- A host is a server on which Vertica is set up, but not necessarily used by a database. You can add hosts to a cluster to make them available to a database.
- A node is a host part of a database.
If one of your node goes down (but the server is still up, I am thinking about data disk failure), according to the vertica documentation it is possible to replace it by another node with another IP address. This case never presented itself to me, so I will trust the documentation on that.
I had the issue of a dead host, though. In that case, the documentation is not enough. As part of the process of replacing a node, you need to add a new host to the cluster. While doing this via the update_vertica utility, Vertica will check connection between all hosts of the cluster. As one host is down, installation will fail.
In that case, the solution is not trivial but quite straightforward, and this is the goal of this post to explain it step by step.
- System preparation
- Update existing node info in the catalog via vsql
- Update admintools.conf
- Install Vertica on the new server
- Configure new node
- Restart new node
Make sure that Vertica is not installed and that /opt/vertica does not exist:
yum remove vertica rm -rf /opt/vertica
- <failed_node_name>: the name of the host you want to replace, taken from the node_name column in v_catalog.nodes
- <newip>: ip address of the replacement host
-- change the node IP ALTER NODE <failed_node_name> HOSTNAME '<newip>'; -- change the nodes spread/control IP ALTER NODE <failed_node_name> CONTROL HOSTNAME '<newip>'; -- re-write spread.conf with the new IP address and reload the running config -- (db should remain UP) SELECT RELOAD_SPREAD(true)
This must be done on a UP node. Any node will do, and in this post we will call it <source host>.
Edit the file /opt/vertica/conf/admintools.conf, by replacing all instances of the old ip address by the new ip address. This means that there will be 3 lines to update:
- [Cluster] > hosts
- [Nodes] : 2 lines, the one starting with the node name and the one with the node number.
For instance, assume we are replacing node2 from a 3-node cluster, from ip 10.0.0.2 to ip 10.0.0.42
Before, here were the relevant lines of admintools.conf:
[Cluster] hosts = 10.0.0.1,10.0.0.2,10.0.0.3 [Nodes] node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin v_spil_dwh_node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin
After, notice the parts in bold:
[Cluster] hosts = 10.0.0.1,10.0.0.42,10.0.0.3 [Nodes] node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin v_spil_dwh_node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin
On the same host used in the previous step, <source host> use update_vertica to add the new host to the cluster. The mandatory options are
--rpm and the options you used at install time (which you can find in /opt/vertica/config/admintools.conf) or the path to your config file (
--config-file/-z) if you used one.
Do NOT use the -S/
--add-hosts or -R/
--remove-hosts switches. You most likely will use -u/
--dba-user-password and maybe a few more.
sudo /opt/vertica/sbin/update_vertica --rpm <complete path of RPM> -u <user> -g <group> ...
This script will verify all hosts, and will install the rpm on the new one.
Login as dbadmin (or whichever user is your database administrator) on the new node, and recreate the base and data directory as they were in the failed node. Assuming that:
- the failed node was node2,
- your database is named $dwh,
- your base directory is /home/dbadmin,
Then create the following:
mkdir /home/dbadmin/$dwh mkdir /home/dbadmin/$dwh/v_$dwh_node0002_data mkdir /home/dbadmin/$dwh/v_$dwh_node0002_catalog
You can look on any UP node to have an example of hierarchy.
Still from the node where you edited the config file, <source_host>, distribute them via the admintools:
- run as dbadmin user /opt/vertica/bin/admintools
- go to Configuration Menu > Distribute Config Files
- select Database Configuration and Admintools Meta-Data.
If you cannot find spread.conf under /home/dbadmin/$dwh/v_$dwh_node0002_catalog copy it from <source_host>. You can check that spread.conf now have the IP 10.0.0.42 instead of 10.0.0.2.
Finally, as a last sanity check, have a look at /opt/vertica/config/admintools.conf and make sure that the new IP appears instead of the old one.
Use the the admintools /opt/vertica/bin/admintools and select “Restart Vertica on Host”. The node wil then start the recovery process. All missing data will be copied to it, and once done it will join the cluster which will be complete again.