Vertica out of the box is an amazing system, not needing a lot of configuration to perform well. That said, a few easy tweaks can even improve its performance. This first post will explain what can be done at a system level, a second post will suggest a few best practices.
This is the way Linux prefetches data. If you fetch a small amount of data from disk, the main cost incurred is head positioning and disk rotation. This means that the cost of fetching a larger amount of data in the same position on disk is dwarfed by the initial cost. This command fetches 8MB of data on each read. Just issue the following command as root or with sudo for a runtime change. Copy the line in /etc/rc.local for permanent change.
# where sdxx is the disk on which your data sits /sbin/blockdev --setra 8192 /dev/sdxx
The swappiness is what tells a Linux system how to balance data between memory and swap. By default it is set up to 60. Vertica does require some swap, even if you have enough memory. Not having enough swap can result in the out of memory killer (oom-killer) deciding to kill Vertica. Note that Vertica recommends having at least 2GB of cache in any case. Run the following as root or sudo at runtime, or write it in /etc/rc.local for permanent effect.
echo 0 > /proc/sys/vm/swappiness
Powertop is a little tool looking at what consumes the most energy on a system. Although mostly useful for laptops, it can tell you if some applications or daemon you would not think of is consuming energy, and thus probably resources.
Set up the IO scheduler
The default Linux scheduler is cfq (completely fair scheduler) which tries to balance the need of all applications. Although ideal for a desktop usage, this is not what we want for Vertica. We do not want to be fair, we just want Vertica to perform as fast as it can. For this purpose, the deadline scheduler, which goal is to reduce latency is a much better choice. Run the following as root or sudo, or again write it in /etc/rc.local to keep the change permanent.
echo 'deadline' > /sys/block/sdxx/queue/scheduler
Externalise ext4 journal
Ext4, the only filesystem supported by Vertica (with its parent, ext3), is a journaling filesystem. This means that it keeps track of changes to be made in a journal before committing the changes to disk, speeding up the recovery in case of crash. This journal is kept on disk, so adds extra writes. If you have another disk available, you can write the log on another disk, thus reducing the amount of IOs.
I will assume that the mount point on which vertica writes its data is /mnt/vertica, and the partition is /dev/sda1. The journal partition will be /dev/sdb1.
# stop all relevant services
# you can see with lsof which users are using with files
# unmount the current ext4 fs
# remove journal from current FS
# set up external journal
tune2fs -O ^has_journal
tune2fs -o journal_data_ordered -j -J device=
Data partition mount option
By default, on ext4 each read and write of a file updates the metadata of the file to write the access time. This means that even a read will result on a disk write. This can be avoided to increase performance by adding an option in your /etc/fstab in the line mounting the Vertica partition. You have 2 options:
- noatime: this prevents update on read and write, for the highest performance gain
- relatime: this prevents update in read, but keep updating on writes
Create a catalog partition
The catalog can be put on its own partition to relieve the data partition from extra writes. The easiest is then to create a symlink from the initial catalog directory to a directory in the catalog partition.
I set all these enhancements up at the time were performance was very bad due to bad configuration. They had a huge impact (even the simplest queries ran about 2 orders magnitude faster), but I sadly do not have numbers to pinpoint exactly which improvement were the most efficient. I can just tell you that all together they are fantastic.
Pingback: Vertica optimisation part 2: best practices | This DWH guy