The Vertica COPY statement

Why not just use INSERT?

Vertica being a columnar database, this means that data is grouped by column on disk. The consequence is that inserting or updating one row means updating all columns of the table, and so basically the whole table. To prevent this, Vertica has a batch uploader dealing with the update in a optimised way. Using it showed us an improvement of 15 rows inserted per second with INSERT to many thousands with COPY.

What is specific about COPY?

Basically, Vertica does not insert rows directly on disk. They are first inserted on the WOS, or Write Optimised Store, in memory. Loading data is thus very quick. After a while, when the WOS is full or after load, Vertica writes the WOS to the ROS, or Read Optimised Store.

Vertica itself manages overflowing from WOS to ROS and the merging of small inefficient ROS containers. It basically does all the housekeeping for you, even if you can influence it if you really want to by managing yourself the tuple mover.

How to use COPY?

This is quite easy, the vertica doc explains all the dirty details, but the main use will be something like this to load a compressed csv with 1 header line that we want to skip. In this case the CSV is on your client machine:

COPY schema.table 
    (column_a, column_b...)
FROM LOCAL '/tmp/bigdata.csv.gz' GZIP
WITH
    DELIMITER AS ','
    ENCLOSED BY '"'
    SKIP 1

To go further

This is really the crux on it. All the rest is technical details. They can be very powerful details, and I will explain a few here.

Source file location

The file can be located on your client (FROM LOCAL ‘path’) or already be on one of your nodes (FROM ‘path’ ON nodename). You can even load FROM STDIN.

Multiple files at once

Vertica supports globals. You can thus load for instance ‘/tmp/201312*.csv’.

Specification per column

The specifications (DELIMITER, ENCLOSED BY) are in my example defined globally. You can define them per column after each column name. If for instance the first column has ‘:’ as a delimiter, and the others ‘;’, you would write:

COPY schema.table 
    (column_a DELIMITER ':', column_b)
FROM LOCAL '/tmp/bigdata.csv.gz' GZIP
    DELIMITER ';'

Skip or transform columns

Use the FILLER parameter. The doc already has great examples, so I will just show them here. Transformation can be done this way:

CREATE TABLE t (k TIMESTAMP);
COPY t(year FILLER VARCHAR(10),
       month FILLER VARCHAR(10), 
       day FILLER VARCHAR(10), 
       k AS TO_DATE(YEAR || MONTH || DAY, 'YYYYMMDD'))
FROM STDIN NO COMMIT;
2009|06|17
1979|06|30
2007|11|26
\.
SELECT * FROM t;
         k 
--------------------- 
 2009-06-17 00:00:00 
 1979-06-30 00:00:00 
 2007-11-26 00:00:00 
 (3 rows)

Skipping columns is done by specifying the datatype to ignore. Not that in this example the first column is casted to timestamp.

create table t (k timestamp);
copy t(y FILLER date FORMAT 'YYYY-MM-DD', 
       t FILLER varchar(10),
       k as y) from STDIN no commit;
2009-06-17|2009-06-17
\.

Load method: big load vs. small load

You can add a parameter at the end of the statement to chose the load method: AUTO, TRICKLE or DIRECT.

  • AUTO is the default, will load first in the WOS (memory), then overflows to ROS (disk) when WOS is full.
  • TRICKLE for small loads, load only into the WOS (memory). Copying to ROS (disk) is done after loading. You will get an error if the WOS is full.
  • DIRECT for big batches, loads into the ROS (disk).

Error management

Last useful detail, you can setup some error management to do some postmortem on data which did not load properly.

COPY t FROM STDIN
    REJECTMAX 10
    EXCEPTIONS '/tmp/exceptions'
    REJECTED DATA '/tmp/raw_errors'

This will error the COPY statement out after 10 (REJECTMAX) rejected rows. The raw data of those rows will be written to REJECTED DATA path, with helpful information stored at EXCEPTION path.

Note that the REJECTED DATA file can become quite big. In one instance, I wrongly overrode the RECORD TERMINATOR parameter, which basically made Vertica think that my whole file was one big line. This line was thus of course badly formatted, and was helpfully written to REJECTED DATA. This file essentially became a full copy of my source file, blowing my partition up. So be careful here!

Monitoring

By looking at the v_monitor.load_streams table you can see information about historical and even current loads. The number of row loaded and sorted are the most interesting values to get.

select * from v_monitor.load_streams;
-[ RECORD 1 ]----------+---------------------------------
session_id             | vertica.local-27774:0xa799
transaction_id         | 45035996273707887
statement_id           | 1
stream_name            |
schema_name            | reports
table_id               | 45035996273722464
table_name             | sales
load_start             | 2014-08-20 09:02:55.631703+02
load_duration_ms       |
is_executing           | t
accepted_row_count     | 0
rejected_row_count     | 0
read_bytes             | 0
input_file_size_bytes  | 0
parse_complete_percent |
unsorted_row_count     | 435666962
sorted_row_count       | 60687177
sort_complete_percent  | 13
Advertisements

One thought on “The Vertica COPY statement

  1. Pingback: Vertica optimisation part 2: best practices | This DWH guy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s