Storage and Dataloading in Vertica

Column based stores are optimized for reading, but Vertica boasts quick write times and no locking issues. How does it manage this? Vertica uses its memory as a write optimized store before commiting data to disk in the main read optimized store.

Vertica's ROS and WOS architecture

Data loading

There are two commands for inserting data into the database. The first any SQL user should be familiar with; INSERT. This works pretty much as you would expect. The second is COPY. COPY takes delimited plain text data from a file or STDIN and bulk loads it into the database.

This is where the nuance of the WOS and ROS come in. COPY can run in three modes. The default mode (AUTO) fills up the WOS before writing to disk, and is intended for small to medium sized updates. COPY DIRECT inserts straight into the ROS and should be used for large sets of data. A third mode, COPY TRICKLE is available for small amounts of data. It is more efficient than the default, but will fail if your load uses up all the RAM.

Tuple Mover

We may have an optimised write store, but the data does need to get to disk eventually. This process is handled by the Tuple Mover. It batch compresses and writes items in the WOS to disk. This batching reduces the overheads of disk access and spreads it across the entire batch.

Data on disk is stored as a collection of read optimized stores. Each transaction will write to a new store as a way of preventing the overheads of merging and fragmentation. This also has the advantage of meaning that all accessible data is read only, avoiding the need for locks. Over time, however, these do need to be merged for performance. The Tuple Mover handles this process too.

Vertica's Tuple Mover

COPY Gotchas

COPY is highly optimized for speed. This means that corners are cut more than with INSERT. Data is not transformed by default, you will have to explicitly set your conversion (e.g. for dates) when writing you COPY statement.

More worryingly, referential integrity is not checked. It is check at query time rather than load time. This means that once you’ve spent the time to load your data you may have the server turn around and complain then you try to access it. As a result special care is needed when using¬†COPY.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s