This is the second post suggesting some Vertica performance improvement. After having looked at the system side of things, we will now look into how to use Vertica.
INSERT vs. COPY statements
Vertica is a columnar database. I will not get into details of what this means, but as data is stored per column, you can imagine that inserting one row impacts the whole table, as all the columns need to be updated.
Our first test with a standard ETL tool (pentaho) was using a usual INSERT into a big-ish table. We reached the great feat of inserting less than 100 rows per seconds. Then we switched to the Vertica way, ie. using a COPY statement. We then reached many thousands per seconds.
Lesson 1: Use the COPY command, not INSERT. I wrote another blog post specifically about the COPY statement.
Window and anlaytical functions
Vertica supports window functions. They are related to grouping as they return aggregated values, but they do so without aggregating the rows. An example taken from the Vertica documentation shows this. You can see that each line is still present (you do not have only one line per department), but you still do have a count, per employee, contrary to the aggregate.
- Aggregate, 1 row per department:
SELECT dept_no, COUNT(*) AS emp_count
GROUP BY dept_no ORDER BY dept_no;
dept_no | emp_count
10 | 2
20 | 6
30 | 3
- Analytical, all the rows are present
, COUNT(*) OVER(PARTITION BY dept_no ORDER BY emp_no) AS emp_count
emp_no | dept_no | emp_count
1 | 10 | 1
4 | 10 | 2
6 | 20 | 1
7 | 20 | 2
8 | 20 | 3
9 | 20 | 4
10 | 20 | 5
11 | 20 | 6
2 | 30 | 1
3 | 30 | 2
5 | 30 | 3
There is a lot more to tell, and this is not the place for this. You can do a lot with this analytical functions, which are a very strong point of Vertica, for instance:
- Rank the longest-standing customers in a particular state
- Calculate the moving average of retail volume over a specified time
- Find the highest score among all students in the same grade
- Compare the current sales bonus each salesperson received against his or her previous bonus
Why do I mention them here? Because Vertica very aggressively optimises those queries. It knows how to run the partitions in parallel, making for dramatic speed improvements when you can rewrite joins with window function. With a bit of experience, you even end up with SQL more compact and more readable.
Lesson 2: Use window functions
This is a big strength of Vertica. When a table is created, it creates a logical schema (CREATE TABLE), but a default physical schema as well (CREATE PROJECTION). A projection is a way to physically store data on disk, and are completely transparent for the end user. You as an administrator can still tailor your projections, though.
If for instance you know that for a specific table a specific column will always be in the WHERE condition, you can put it first in the projection order. As Vertica stores data ordered on disk, this column will thus be very quick to access, and Vertica, without even needing to uncompress data, will know where to find the relevant dataset. If this column would have been the last one on the projection order, it would have been already at least partly split because of the previous column, thus increasing access time.
If you think “index” you might be right, but this is a taboo word in the Vertica world.
What does that mean for you?
By default, Vertica (via the database designer or at first insert) is pretty good a creating projections for you. In some cases, though, you might want to go beyond. You could for instance create 2 projections for a table. This would double load time, of course, as data would have to be written twice on disk, but if the projections are properly tailored, they can make specific queries a lot faster (query-specific projection). If you are using a MERGE statement, having the 2 parts of the merge identically segmented (segmentation is part of defining a projection) will make the merge a lot faster, as it will prevent network traffic.
Furthermore, you can create pre-join projections, where you pay the join cost at load time, not a query time.
A word of warning here. Projections are extremely powerful, but you really can shoot yourself in the foot with them. You could, for instance, create twice the same projection on a table. This would double load time, without having any impact at query time, as Vertica choses only one projection per query.
There is a lot more to say about projections, you can read all about it on the online documentation.
Lesson 3: Set up good projections
Timestamp considered harmful
Just imagine. You have a big table, containing events. You receive 1000s of event per second, all containing a timestamp, at second level. Vertica will of course happily load those events, compressing and storing them on disk.
You then want to look at you events from yesterday. You thus SELECT based on your timestamp column. This column has a cardinality of 86.4k for yesterday only. This means 2.5M for a month. Of course, this will work. But the data stored on disk for this column on disk is huge due to the cardinality, and Vertica will have a lot of extra process to do.
Ask yourself this question: do you really need data at the level of the second? What is smallest timeslice you use when querying this table? If your heaviest queries never look at more precise than one hour, create an extra column with only an hourly timestamp. Just like that, you divided the cardinality of your column, and thus of the first column to look at in your query, by 3.6k, thus saving on IO and processing time. Of course, this does not mean that you need to fully get rid off the timestamp column, it will still be there for adhoc queries when needed. But you usual big queries will run a lot faster.
Lesson 4: Do not use timestamp in heavy queries.