Python + Vertica = pyvertica

At $work we use mostly python for all the glue between systems and other developments. Naturally, when we started using Vertica, we wanted to connect to it via python. Of course, vertica is accessible via ODBC and supports standard SQL, but to fully harness its specificities, we needed a bit more. INSERT statements are very slow, for instance, and should be replaced by COPY, as already described in the best practices.

We quickly decided to create an open source library which can handle that for us. This hides the dirty details of the COPY statement, and allows us to insert data in a pythonic way.

The best way to show it in action is to display a quick example, taken from the documentation:


from pyvertica.batch import VerticaBatch

batch = VerticaBatch(
  odbc_kwargs={'dsn': 'VerticaDWH'},
  table_name='schema.my_table',
  truncate=True,
  column_list=['column_1', 'column_2'],
  copy_options={
    'DELIMITER': ',',
  }
)

row_list = [
  ['row_1_val_1', 'row_1_val_2'],
  ['row_2_val_1', 'row_2_val_2'],
  ...
]

for column_data_list in row_list:
  batch.insert_list(column_data_list)

error_bool, error_file_obj = batch.get_errors()

if error_bool:
  print error_file_obj.read()

batch.commit()

As you can see, inserting data is just as easy as calling insert_list on a well defined object. There are as well other helpers to insert raw data, if for instance you just read a line from a csv file, there is no need to parse it in python to define columns, you can just throw all data to pyvertica. All COPY options are useable when you define the object, they will just be passed as is to Vertica.

Technically, this was an interesting little project. The COPY statement reads data from a file (actual file, STDIN or fifo). As we did not want to duplicate data on disk, pyvertica actually creates a fifo, spawns a thread running the COPY statement and send data to the fifo. Whatever you send to pyvertica is converted into a csv-like structure, which is what COPY understands.

We use it in production since months without a single issue. Depending on used hardware and extra processing, it is easily possible to import 100.000 records/second.

The installation is very easy as pyvertica can be found in the python package index. Just type

pip install pyvertica

and you are good to go.

To actually connect to vertica, you have the choice of

pass a valid odbc connection to pyvertica
setup a proper odbc.ini and pass the DSN to pyvertica
craft your ow DSN-less line and pass it to pyvertica.

All documentation can be found on readthedocs page of pyvertica, and the source can be found and forked on the pyvertica github repository.

This Data Guy

Journey in a world of big(ger) data

Python + Vertica = pyvertica

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply