Vertica: some uses of sequences

When you want to have a unique identifier for a table, Vertica gives you 3 options. Create a numeric column either as AUTO_INCREMENT or IDENTITY, or use a sequence. The doc is comprehensive about the differences, but it can be summed up quite easily:

  • IDENTITY and AUTO_INCREMENT are part of a table, a SEQUENCE is a standalone object
  • AUTO_INCREMENT is simple and stupid and cannot be tweaked much, IDENTITY has a few more options (start value, increment) a SEQUENCE has a lot of options.

I will here talk a bit about the sequences, as they are the one allowing the most freedom.

Forcing the value of a incrementing column

If you have a table with an ‘id’ column, defined as a IDENTITY or AUTO_INCREMENT, you cannot set a value to this field during data load.:

CREATE TABLE tst_auto (id AUTO_INCREMENT, value varchar(10));
INSERT INTO tst_auto (id, value) VALUES (42, 'cheese');
ERROR 2444: Cannot insert into or update IDENTITY/AUTO_INCREMENT column "id"

If, on the other hand, you use a SEQUENCE, this is possible:

CREATE TABLE test_seq (id INT, value VARCHAR(10));
CREATE SEQUENCE seq;
ALTER TABLE test_seq ALTER COLUMN id set default NEXTVAL('seq');

You can then see that it does what you expect:

INSERT INTO test_seq (value) VALUES ('default');
INSERT INTO test_seq (id, value) VALUES (42, 'forced');
select * from test;
 id | value
----+---------
 1  | default
 42 | forced

If you use this, you must of course be careful that there are no duplication. If the example, you could for instance set the next value of the sequence to 43:

ALTER SEQUENCE seq RESTART WITH 43;

Using a sequence as a global identifier

The fun thing with a sequence is that it can be used on more than one table, thus giving you a global identifier, for instance:

-- 2 tables...
CREATE TABLE tst  (id INT, value varchar(10));
CREATE TABLE tst2 (id INT, value varchar(10));

-- 1 sequence...
CREATE SEQUENCE tst_seq;

-- ... said sequence is used by both tables
ALTER TABLE tst  ALTER COLUMN id set default NEXTVAL('tst_seq');
ALTER TABLE tst2 ALTER COLUMN id set default NEXTVAL('tst_seq');

-- testing...
INSERT INTO tst  (value) VALUES ('tst');
INSERT INTO tst2 (value) VALUES ('tst2');

-- success!
SELECT * FROM tst;
--  id |  value
-- ----+---------
--   1 | tst
-- (1 row)

SELECT * FROM tst2;
--  id |  value
-- ----+----------
--   2 | tst2
-- (1 row)
Advertisement

Easy import from Hive to Vertica

Properly setup, Vertica can connect to Hcatalog, or read hdfs files. This does require some DBA work, though.

If you want to easily get data fro Hive to Vertica, you can use the COPY statement with the LOCAL STDIN modifier and pipe the output of Hive to the input of Vertica. Once you add a dd in the middle to prevent the stream to just stop after a while, this works perfectly. I am not so sure why dd is needed, but I suppose it buffers data and makes the magic happen.

hive -e "select whatever FROM wherever" | \
dd bs=1M | \
/opt/vertica/bin/vsql -U $V_USERNAME -w $V_PASSWORD -h $HOST $DB -c \
"COPY schema.table FROM LOCAL STDIN DELIMITER E'\t' NULL 'NULL' DIRECT"

Of course, the previous statement needs to be amended to use your own user, password and database.

The performance are quite good with this, although I cannot give a good benchmark as in our case the hive statement was not trivial.

One thing to really take care of is where you run this statement. You can run it from everywhere as long as hive and Vertica are accessible, but be aware that data will flow from hive to your server to Vertica. Running this statement on a Vertica node or your hive server will reduce the network traffic and might speed up things.

This post is based on my answer to a question on stackoverflow.

A GUI for Vertica: DbVisualizer

vsql is very porwerful, but it is always nice to have a nice GUI tool with your database. As Vertica can be accessed via ODBC, most tools can at least provide some kind of GUI on top of Vertica.

DbVisualizer (“dbvis”) goes one step further. They teamed up with HP to make dbvis aware of Vertica specifities as projections, sessions, load streams and more. You can find the list of Vertica features supported by dbvis on their website. Note that some of them are only available for the Pro version, unfortunately (note the little in the link above) but a lot of goodness is available in the free version.

A few screenshots below will show some nice Vertica-specific options.

Creating a table, with column encoding

CreateTableDBvis
Table with projections

TableWithProj

DBA views (notice sessions, locks, tuple mover…)

DBAviews

Vertica: Panic – Data consistency problems

While replacing a node and during the recovery, the node did reboot (human mistake). After actual reboot the recover did not proceed and the node stayed in DOWN state, even if we tried to restart it via the admintools.

In vertica.log, we could see the following lines:

<PANIC> @v_dwh_node0003: VX001/2973: Data consistency problems found; startup aborted
 HINT: Check that all file systems are properly mounted. Also, the --force option can be used to delete corrupted data and recover from the cluster
 LOCATION: mainEntryPoint, /scratch_a/release/vbuild/vertica/Basics/vertica.cpp:1441

As the logs nicely suggest, using the (undocumented) --force option can help. That said, this option cannot be used from the admintool curse interface, and must be used from the command line:

/opt/vertica/bin/admintools -t restart_node -d $db_name -s $host --force

That way corrupted data was deleted, and the recovering could carry on nicely.

Vertica upgrade or install: Default shell must be set to bash

When upgrading Vertica, it will check if bash is the default shell for the dbadmin user, and complain loudly if this is not the case:

Error: Default shell on the following nodes are not bash. Default shell must be set to bash.
10.0.0.1 for dbadmin:
10.0.0.2 for dbadmin:
Exiting...
Installation FAILED with errors.

Of course, if the shell is indeed not bash, you can fix it by running

chsh -s /bin/bash

on the relevant nodes as the dbadmin user.

In some cases, bash is indeed the shell, but due to ssh or sudo configuration, the installer does not see it. In that case, using visudo to edit the /etc/sudoers file just add the following lines:

Defaults:dbadmin !requiretty
dbadmin ALL=(ALL) NOPASSWD:ALL

Those lines are needed only at install, and can be reverted afterwards.

Replace a failed node in Vertica

Just a little semantic reminder before we dive in:

  • A host is a server on which Vertica is set up, but not necessarily used by a database. You can add hosts to a cluster to make them available to a database.
  • A node is a host part of a database.

If one of your node goes down (but the server is still up, I am thinking about data disk failure), according to the vertica documentation it is possible to replace it by another node with another IP address. This case never presented itself to me, so I will trust the documentation on that.

I had the issue of a dead host, though. In that case, the documentation is not enough. As part of the process of replacing a node, you need to add a new host to the cluster. While doing this via the update_vertica utility, Vertica will check connection between all hosts of the cluster. As one host is down, installation will fail.

In that case, the solution is not trivial but quite straightforward, and this is the goal of this post to explain it step by step.

  1. System preparation
  2. Update existing node info in the catalog via vsql
  3. Update admintools.conf
  4. Install Vertica on the new server
  5. Configure new node
  6. Restart new node

1 – System preparation

Make sure that Vertica is not installed and that /opt/vertica does not exist:

yum remove vertica
rm -rf /opt/vertica

2 – Update existing node info in the catalog via vsql

  • <failed_node_name>: the name of the host you want to replace, taken from the node_name column in v_catalog.nodes
  • <newip>: ip address of the replacement host
-- change the node IP
ALTER NODE &lt;failed_node_name&gt; HOSTNAME '&lt;newip&gt;';
-- change the nodes spread/control IP
ALTER NODE &lt;failed_node_name&gt; CONTROL HOSTNAME '&lt;newip&gt;';
-- re-write spread.conf with the new IP address and reload the running config
-- (db should remain UP)
SELECT RELOAD_SPREAD(true)

3 – Update admintools.conf

This must be done on a UP node. Any node will do, and in this post we will call it <source host>.

Edit the file /opt/vertica/conf/admintools.conf, by replacing all instances of the old ip address by the new ip address. This means that there will be 3 lines to update:

  • [Cluster] > hosts
  • [Nodes] : 2 lines, the one starting with the node name and the one with the node number.

For instance, assume we are replacing node2 from a 3-node cluster, from ip 10.0.0.2 to ip 10.0.0.42

Before, here were the relevant lines of admintools.conf:

[Cluster]
hosts = 10.0.0.1,10.0.0.2,10.0.0.3

[Nodes]
node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin
v_spil_dwh_node0002 = 10.0.0.2,/home/dbadmin,/home/dbadmin

After, notice the parts in bold:

[Cluster]
hosts = 10.0.0.1,10.0.0.42,10.0.0.3

[Nodes]
node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin
v_spil_dwh_node0002 = 10.0.0.42,/home/dbadmin,/home/dbadmin

4 – Install Vertica on the new server

On the same host used in the previous step, <source host> use update_vertica to add the new host to the cluster. The mandatory options are --rpm and the options you used at install time (which you can find in /opt/vertica/config/admintools.conf) or the path to your config file (--config-file/-z) if you used one.

Do NOT use the -S/--control-network, -A/--add-hosts or -R/--remove-hosts switches. You most likely will use -u/--dba-user, -g/--dba-group, -p/--dba-user-password and maybe a few more.

sudo /opt/vertica/sbin/update_vertica --rpm <complete path of RPM> -u <user> -g <group> ...

This script will verify all hosts, and will install the rpm on the new one.

5 – Configure new node

Login as dbadmin (or whichever user is your database administrator) on the new node, and recreate the base and data directory as they were in the failed node. Assuming that:

  • the failed node was node2,
  • your database is named $dwh,
  • your base directory is /home/dbadmin,

Then create the following:

mkdir /home/dbadmin/$dwh
mkdir /home/dbadmin/$dwh/v_$dwh_node0002_data
mkdir /home/dbadmin/$dwh/v_$dwh_node0002_catalog

You can look on any UP node to have an example of hierarchy.

Still from the node where you edited the config file, <source_host>, distribute them via the admintools:

  • run as dbadmin user /opt/vertica/bin/admintools
  • go to Configuration Menu > Distribute Config Files
  • select Database Configuration and Admintools Meta-Data.

If you cannot find spread.conf under /home/dbadmin/$dwh/v_$dwh_node0002_catalog copy it from <source_host>. You can check that spread.conf now have the IP 10.0.0.42 instead of 10.0.0.2.

Finally, as a last sanity check, have a look at /opt/vertica/config/admintools.conf and make sure that the new IP appears instead of the old one.

6 – Restart new node

Use the the admintools /opt/vertica/bin/admintools and select “Restart Vertica on Host”. The node wil then start the recovery process. All missing data will be copied to it, and once done it will join the cluster which will be complete again.

Vertica: move_partitions_to_table()

If you have a load in multiple steps (say data is first loaded in a staging schema, then moved to another one), using MOVE_PARTITIONS_TO_TABLE() is the best option. This is very fast, as you do not need to actually issue INSERT or DELETE statements or deal with the delete vectors. Vertica just moves in one go a block of data, and you are done.

When you issue this function, Vertica will first issue a moveout to empty the WOS before moving data. This does make sense, as if data is loaded in memory, you still want to make sure it is moved with the rest of the of the partition already on disk.

Since at least Vertica 7.0.1-1, using this function might give you the error message:

select move_partitions_to_table ('stg.A', 0, 24, 'vault.A')
ERROR: A Moveout operation is already in progress on projection vault.B_node0002

As you can see, a moveout on table A fails because of a moveout on table B, which does not seem to make sense.

It has actually been confirmed that it does not make sense, and it actually is a bug in Vertica 7.0.1-1, which should be fixed in theory in the next service pack.

In the mean time, you can work around the bug by issuing INSERTs/DELETEs, or have your code checks if a moveout is already running before issuing the move_partitions_to_table() statement.

EDIT: The hotfix 7.1.1-1 (11/2014) fixes the bug.

Vertica: rename multiple tables in one go

I have the use case where I need to regularly fully drop and recreate a table in Vertica. To try to keep the period without data to a minimum, I want to load data in an intermediate table, then rename the old table out of the way, and rename the intermediate to its final name. It turns out that Vertica allows this in one command, hopefully thus avoiding race conditions.

After loading, before renaming:

vertica=> select * from important;
 v
-----
 old
(1 row)

vertica=> select * from important_intermediate ;
 v
-----
 new
(1 row)

Multiple renaming:

ALTER TABLE important,important_intermediate RENAME TO important_old, important;

after:

vertica=> select * from important;
 v
-----
 new
(1 row)
vertica=> select * from important_old;
 v
-----
 old
(1 row)

You will probably want to DROP import_old now.

Vertica tips and resources

I just found a few interesting resources around vertica:

  • A blog named vertica tips. As the name says, it contains a collection of very useful vertica tips and information around vertica.
  • By the author of this blog, the github repository vertica-kit contains a very nice and easy to use set of SQL queries to understand the current state of your server.
  • Another set of nice tools is contained in this vertica toolbelt, to complement standard monitoring tools.

Vertica ODBC error messages and solutions

Those are error messages and solutions found after lots of trials and errors. I am mostly using python with Vertica, some some solutions might thus be python specific, but most should be generic enough.

[HY000] [unixODBC][Vertica][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function. (11560) (SQLDriverConnect)

The ODBC connection does not find a properly defined DSN. Reasons include:

  • Path not existing in one of the odbc.ini or odbcinst.ini files (check mostly ODBCInstLib, Driver, Driver64).

[22001] [Vertica][ODBC] (10170) String data right truncation on data from data source: String data is too big for the driver’s data buffer. (10170) (SQLPutData)

This is a unicode issue. Reasons might be:

  • Old pyodbc which does not handle UTF-8  properly (try to use version 3+)
  • Vertica’s VARCHAR length is given in bytes, not character. So if you have UTF8 characters in a string, you might go above the limit without noticing. Eg. a VARCHAR(1) can hold ‘0’ but not ‘€’.
  • Pyodbc does not handle unicode properly. If you are using python, encode in UTF-8.

[IM002] [unixODBC][Driver Manager]Data source name not found, and no default driver specified (0) (SQLDriverConnect)

The DSN used does not exist. Reasons include:

  • Typo in the DSN in your code (you are asking for a DSN not defined in odbc.ini).
  • odbc.ini file syntax invalid (for instance closing square bracket forgotten).
  • DSN not defined in the used odbc.ini file.
  • Wrong odbc.ini file used, hence DSN not found. This can happen if a $HOME/.odbc.ini file, often created by default, exists.
  • The odbc.ini is not in the expected path (/etc/odbc.ini). Pointing the ODBCINI environment variable to the right path might work.
  • The odbc.ini file references a Driver in the the relevant DSN section which is not defined in /etc/odbcinst.ini.

[HY000] [unixODBC][DSI] The error message NoSQLGetPrivateProfileString could not be found in the en-US locale. Check that /en-US/ODBCMessages.xml exists. (-1) (SQLDriverConnect)

Vertica needs some extra specifications in either /etc/vertica.ini (default), or in the file pointed to by the VERTICAINI environment variable:

[Driver]
ErrorMessagesPath = /opt/vertica/lib64/
ODBCInstLib = /usr/lib/x86_64-linux-gnu/libodbcinst.so
DriverManagerEncoding=UTF-16

Usually I just add this to odbc.ini and points VERTICAINI to it.

pyodbc.Error: (‘H’, ‘[H] [unixODBC][ (4294967295) (SQLDriverConnectW)’)

You are using an old version of pyodbc. Upgrade system wide or create a virtualenv and pip install pyodbc.