Why not just use INSERT?
Vertica being a columnar database, this means that data is grouped by column on disk. The consequence is that inserting or updating one row means updating all columns of the table, and so basically the whole table. To prevent this, Vertica has a batch uploader dealing with the update in a optimised way. Using it showed us an improvement of 15 rows inserted per second with INSERT to many thousands with COPY.
What is specific about COPY?
Basically, Vertica does not insert rows directly on disk. They are first inserted on the WOS, or Write Optimised Store, in memory. Loading data is thus very quick. After a while, when the WOS is full or after load, Vertica writes the WOS to the ROS, or Read Optimised Store.
Vertica itself manages overflowing from WOS to ROS and the merging of small inefficient ROS containers. It basically does all the housekeeping for you, even if you can influence it if you really want to by managing yourself the tuple mover.
How to use COPY?
This is quite easy, the vertica doc explains all the dirty details, but the main use will be something like this to load a compressed csv with 1 header line that we want to skip. In this case the CSV is on your client machine:
COPY schema.table
(column_a, column_b...)
FROM LOCAL '/tmp/bigdata.csv.gz' GZIP
WITH
DELIMITER AS ','
ENCLOSED BY '"'
SKIP 1
To go further
This is really the crux on it. All the rest is technical details. They can be very powerful details, and I will explain a few here.
Source file location
The file can be located on your client (FROM LOCAL ‘path’) or already be on one of your nodes (FROM ‘path’ ON nodename). You can even load FROM STDIN.
Multiple files at once
Vertica supports globals. You can thus load for instance ‘/tmp/201312*.csv’.
Specification per column
The specifications (DELIMITER, ENCLOSED BY) are in my example defined globally. You can define them per column after each column name. If for instance the first column has ‘:’ as a delimiter, and the others ‘;’, you would write:
COPY schema.table (column_a DELIMITER ':', column_b) FROM LOCAL '/tmp/bigdata.csv.gz' GZIP DELIMITER ';'
Skip or transform columns
Use the FILLER parameter. The doc already has great examples, so I will just show them here. Transformation can be done this way:
CREATE TABLE t (k TIMESTAMP);
COPY t(year FILLER VARCHAR(10),
month FILLER VARCHAR(10),
day FILLER VARCHAR(10),
k AS TO_DATE(YEAR || MONTH || DAY, 'YYYYMMDD'))
FROM STDIN NO COMMIT;
2009|06|17
1979|06|30
2007|11|26
\.
SELECT * FROM t;
k
---------------------
2009-06-17 00:00:00
1979-06-30 00:00:00
2007-11-26 00:00:00
(3 rows)
Skipping columns is done by specifying the datatype to ignore. Not that in this example the first column is casted to timestamp.
create table t (k timestamp); copy t(y FILLER date FORMAT 'YYYY-MM-DD', t FILLER varchar(10), k as y) from STDIN no commit; 2009-06-17|2009-06-17 \.
Load method: big load vs. small load
You can add a parameter at the end of the statement to chose the load method: AUTO, TRICKLE or DIRECT.
- AUTO is the default, will load first in the WOS (memory), then overflows to ROS (disk) when WOS is full.
- TRICKLE for small loads, load only into the WOS (memory). Copying to ROS (disk) is done after loading. You will get an error if the WOS is full.
- DIRECT for big batches, loads into the ROS (disk).
Error management
Last useful detail, you can setup some error management to do some postmortem on data which did not load properly.
COPY t FROM STDIN
REJECTMAX 10
EXCEPTIONS '/tmp/exceptions'
REJECTED DATA '/tmp/raw_errors'
This will error the COPY statement out after 10 (REJECTMAX) rejected rows. The raw data of those rows will be written to REJECTED DATA path, with helpful information stored at EXCEPTION path.
Note that the REJECTED DATA file can become quite big. In one instance, I wrongly overrode the RECORD TERMINATOR parameter, which basically made Vertica think that my whole file was one big line. This line was thus of course badly formatted, and was helpfully written to REJECTED DATA. This file essentially became a full copy of my source file, blowing my partition up. So be careful here!
Monitoring
By looking at the v_monitor.load_streams table you can see information about historical and even current loads. The number of row loaded and sorted are the most interesting values to get.
select * from v_monitor.load_streams; -[ RECORD 1 ]----------+--------------------------------- session_id | vertica.local-27774:0xa799 transaction_id | 45035996273707887 statement_id | 1 stream_name | schema_name | reports table_id | 45035996273722464 table_name | sales load_start | 2014-08-20 09:02:55.631703+02 load_duration_ms | is_executing | t accepted_row_count | 0 rejected_row_count | 0 read_bytes | 0 input_file_size_bytes | 0 parse_complete_percent | unsorted_row_count | 435666962 sorted_row_count | 60687177 sort_complete_percent | 13
Pingback: Vertica optimisation part 2: best practices | This DWH guy