Vertica Error 4115 and new tables

I’m in the process of creating data tables to be populated by our load system. This will be updated periodically, so using a MERGE statement makes sense. After creating the table and attempting to merge I receive the following error:

ERROR 4115:  No super projection found for table table_name

But isn’t a super projection created by default when a table is created? Checking the documentation confirms that yes a super projection should be created. So why am I getting this error?

This comes down to one of Vertica’s counter-intuitive optimizations. For this table I didn’t describe the super projection so I am relying on Vertica to create one for me (I may revisit the projections once I’ve made my proof of concept). So the information can be derived and the projection will be created if I INSERT into the table.

In fact, doing that resolves the issue. The call to INSERT creates the projection and the problem resolves itself. It would seem that Vertica does not think to create the projection at MERGE time despite the fact the INSERTs and SELECTs will work independently of one another.

EDIT: Yes, this does have the desired effect, but the preferred work around is to add an encoding type to any column. Even if you explicitly set it to auto this does the trick. This is slightly less convoluted, but I don’t see why this step should be needed at all.

Advertisements

Storage and Dataloading in Vertica

Column based stores are optimized for reading, but Vertica boasts quick write times and no locking issues. How does it manage this? Vertica uses its memory as a write optimized store before commiting data to disk in the main read optimized store.

Vertica's ROS and WOS architecture

Data loading

There are two commands for inserting data into the database. The first any SQL user should be familiar with; INSERT. This works pretty much as you would expect. The second is COPY. COPY takes delimited plain text data from a file or STDIN and bulk loads it into the database.

This is where the nuance of the WOS and ROS come in. COPY can run in three modes. The default mode (AUTO) fills up the WOS before writing to disk, and is intended for small to medium sized updates. COPY DIRECT inserts straight into the ROS and should be used for large sets of data. A third mode, COPY TRICKLE is available for small amounts of data. It is more efficient than the default, but will fail if your load uses up all the RAM.

Tuple Mover

We may have an optimised write store, but the data does need to get to disk eventually. This process is handled by the Tuple Mover. It batch compresses and writes items in the WOS to disk. This batching reduces the overheads of disk access and spreads it across the entire batch.

Data on disk is stored as a collection of read optimized stores. Each transaction will write to a new store as a way of preventing the overheads of merging and fragmentation. This also has the advantage of meaning that all accessible data is read only, avoiding the need for locks. Over time, however, these do need to be merged for performance. The Tuple Mover handles this process too.

Vertica's Tuple Mover

COPY Gotchas

COPY is highly optimized for speed. This means that corners are cut more than with INSERT. Data is not transformed by default, you will have to explicitly set your conversion (e.g. for dates) when writing you COPY statement.

More worryingly, referential integrity is not checked. It is check at query time rather than load time. This means that once you’ve spent the time to load your data you may have the server turn around and complain then you try to access it. As a result special care is needed when using COPY.

Introduction to Vertica

Looks like I’ll be spending some time using Vertica, a relatively new SQL technology. My main contact with databases has been with MySQL. There is information but far less than I’m use to. So I’m going to document my process. I’ve been looking for a writing challenge. It’ll be good to have a reminder and hopefully help the community at large.

What is Vertica?

Vertica is a column oriented database with an SQL interface. NoSQL may be in vogue but the team at Vertica have gone a different route. They argue that SQL isn’t the issue but it’s the underlying design choices that determines the bottlenecks.

Vertica started its life as C-Store, which was a cross university academic project. It then spun out into its own company and got acquired by HP in 2011.

What is Vertica good at?

It all comes down data throughput. Being column-oriented means that Vertica is optimized for specific reading cases where few columns are used. For example:

SELECT x.col1, sum(y.col2), sum(y.col3)
FROM x, y, z
WHERE x.a = y.a
AND y.b = z.b
AND z.a = 'foo'
GROUP BY x.col1;

Indexes aside, which would perform better: A database that has to read all the columns for the rows in question (in this case, all of them) or a system that only has to read the column referenced in the query? All else equal, the solution that reads the least data should be fastest. This is where a column store like Vertica comes in. It can read the columns required for the query and nothing else.

I did preface this with a big caveat. Indexes can make a huge difference in performance, but what is an index? It is one or a few columns from a table and written separately in a more dense format. Hmm, that’s looks a bit like a column store to me.

The one advantage a column store has over an index is that this design is built in. You do not need to add another index for each query you write. The disk space is already accounted for.

The performance differences, however, look quite different when you take a query that looks at all columns:

SELECT * FROM x;

Here the traditional row based database is in it's element. This is what it's data is laid out for. Grab a whole load of data from disk and present it to the user. The column store, however, has to read each column separately and then reconstruct the tuples before returning them.

There are similar examples covered in more depth here.

What else is there to consider?

I am just a novice here, but there are lots of interesting features that I’ll be looking into over time. I glossed over the details of disk usage today. Vertica stores its data more like a collection of indexes than any thing else. I’ve seen mention of multiple sort orders and a highly integrated use of compression. What implications are there for this redundancy and trading disk usage for CPU time?

Vertica is a shared nothing clustered based system. Dead nodes and quorum are systems that really need to be right.

While the primary store is read optimized Vertica uses an additional write store and memory store to shorten write times. This has great potential for mixed reads and writes.

It’s a complex system and I’m looking for ward to figuring out how it works.