Looks like I’ll be spending some time using Vertica, a relatively new SQL technology. My main contact with databases has been with MySQL. There is information but far less than I’m use to. So I’m going to document my process. I’ve been looking for a writing challenge. It’ll be good to have a reminder and hopefully help the community at large.
What is Vertica?
Vertica is a column oriented database with an SQL interface. NoSQL may be in vogue but the team at Vertica have gone a different route. They argue that SQL isn’t the issue but it’s the underlying design choices that determines the bottlenecks.
Vertica started its life as C-Store, which was a cross university academic project. It then spun out into its own company and got acquired by HP in 2011.
What is Vertica good at?
It all comes down data throughput. Being column-oriented means that Vertica is optimized for specific reading cases where few columns are used. For example:
SELECT x.col1, sum(y.col2), sum(y.col3)
FROM x, y, z
WHERE x.a = y.a
AND y.b = z.b
AND z.a = 'foo'
GROUP BY x.col1;
Indexes aside, which would perform better: A database that has to read all the columns for the rows in question (in this case, all of them) or a system that only has to read the column referenced in the query? All else equal, the solution that reads the least data should be fastest. This is where a column store like Vertica comes in. It can read the columns required for the query and nothing else.
I did preface this with a big caveat. Indexes can make a huge difference in performance, but what is an index? It is one or a few columns from a table and written separately in a more dense format. Hmm, that’s looks a bit like a column store to me.
The one advantage a column store has over an index is that this design is built in. You do not need to add another index for each query you write. The disk space is already accounted for.
The performance differences, however, look quite different when you take a query that looks at all columns:
SELECT * FROM x;
Here the traditional row based database is in it's element. This is what it's data is laid out for. Grab a whole load of data from disk and present it to the user. The column store, however, has to read each column separately and then reconstruct the tuples before returning them.
There are similar examples covered in more depth here.
What else is there to consider?
I am just a novice here, but there are lots of interesting features that I’ll be looking into over time. I glossed over the details of disk usage today. Vertica stores its data more like a collection of indexes than any thing else. I’ve seen mention of multiple sort orders and a highly integrated use of compression. What implications are there for this redundancy and trading disk usage for CPU time?
Vertica is a shared nothing clustered based system. Dead nodes and quorum are systems that really need to be right.
While the primary store is read optimized Vertica uses an additional write store and memory store to shorten write times. This has great potential for mixed reads and writes.
It’s a complex system and I’m looking for ward to figuring out how it works.