WTF is a SuperColumn? Introduction to the Cassandra Data Model

This translation of an article dated September 1, 2009 should be taken into account when reading. - approx. per.

In the past month or two, the Digg engineering team has spent quite a bit of time researching, testing and finalizing Cassandra production. It was a very fun project, but before the fun started, we had to spend some time figuring out what the Cassandra data model is ... the phrase “WTF is a“ super column ”” (“what the hell is that super column ? ”) Was pronounced more than once.

If you’ve previously worked with RDBMSs (this applies to almost everyone), you’ll probably be a bit discouraged by some of the names when studying the Cassandra data model. It took several days of discussions for me and my Digg team before we got the hang of it. A couple of weeks ago, on the developers mailing list, there was a bikeshed process on a completely new naming scheme to resolve the confusion. Throughout the discussion, I thought: “maybe if there are several normal examples, people will not be so confused by the names”. So, this is my attempt to explain the Cassandra data model; It is designed to ensure that you are familiar with, but do not go into the wilds, and, I hope, this will help clarify some things.
')

Pieces

Let's first take a look at the building blocks, before we see how they can all work together.

Column

A column is a minimum data element. This is a triplet containing the name, value, and timestamp. The column represented in the JSON notation:

{ name: "emailAddress" , // value: "arin@example.com" , // timestamp: 123456789 // }

It's all. For simplicity, you can omit the timestamp. And take the column as a name / value pair. Also, it is worth noting that both the name and the value are binary (technically byte []) and can be of any length.

SuperColumn

A supercolumn (SuperColumn) is a collection of a binary name and value, which is essentially a table containing an unlimited number of columns, with a key — the name of a column. Again, imagine this as JSON:

{ name: "homeAddress" , // : value: { // - street: {name: "street" , value: "1234 x street" , timestamp: 123456789}, city: {name: "city" , value: "san francisco" , timestamp: 123456789}, zip: {name: "zip" , value: "94107" , timestamp: 123456789}, } }

Column vs Supercolumn

Both columns and supercolumns are pairs of name and value. The key difference is that the value of a regular column is “row”, and the value of a supercolumn is a table of columns. This is the main difference. Their values contain different data types. Another minor difference is that the supercolumn does not contain a timestamp.

Before we begin to combine

Before going further, I want to simplify our notation with two things: 1) say goodbye to time stamps in columns and 2) pull the names of columns and supercolumns out, so that it will look like a key / value pair. So we move from:

{ name: "homeAddress" , value: { street: {name: "street" , value: "1234 x street" , timestamp: 123456789}, city: {name: "city" , value: "san francisco" , timestamp: 123456789}, zip: {name: "zip" , value: "94107" , timestamp: 123456789}, } }

homeAddress: { street: "1234 x street" , city: "san francisco" , zip: "94107" , }

Group them

There is a structure used for grouping both columns and supercolumns ... this structure is called the family of columns (ColumnFamily) and exists in two variations respectively - regular and super.

Column family

A column family is a structure containing an unlimited number of rows. Wow, did you say the strings? Yes - the lines :) To make it easier to keep in mind, just think of them as the rows of a table in a RDBMS.

So, each row has a key set by the client (you) and contains a set of columns. Again, the keys in the set are the column names and the values are the columns themselves:

UserProfile = { phatduckk: { // // username: "phatduckk" , email: "phatduckk@example.com" , phone: "(900) 976-6666" }, // ieure: { // // username: "ieure" , email: "ieure@example.com" , phone: "(888) 555-1212" age: "66" , gender: "undecided" }, }

Remember: for simplicity, we only show the value of a column, but in fact the values in the set are a whole column.

You can think of it as a hash table / dictionary or associative array. If you started thinking like that, then you are on the right track.

I want to draw your attention to the fact that at this level there is no mandatory scheme. Rows do not have a predefined list of columns that they contain. In our example above, you can see that the row with the key “ieure” contains columns with the names “age” and “gender”, while the string identified by the key “phatduckk” does not contain. This is 100% flexibility: one row can contain 1989 columns, while the other will be only 2. One row can contain a column with the name “foo”, while everyone else will not. Here it is - the prospect of a lack of a scheme in Cassandra.

The column family can also be super.

So, the column family can be of type Standard or Super.

What we looked at above is an example of the Standard type. It is standard because all its rows contain a table of ordinary (not super) columns ... there are no supercolumns.

If the column family is of type Super, then the opposite: each row contains a set of supercolumns. A set where the keys are the names of the supercolumns, and the values are the supercolumns themselves. And just for clarity: the family of supercolumns does not contain the usual columns. Here is an example:

AddressBook = { phatduckk: { // // - // // - // - friend1: {street: "8th street" , zip: "90210" , city: "Beverley Hills" , state: "CA" }, // John' phatduckk' John: {street: "Howard street" , zip: "94404" , city: "FC" , state: "CA" }, Kim: {street: "X street" , zip: "87876" , city: "Balls" , state: "VA" }, Tod: {street: "Jerry street" , zip: "54556" , city: "Cartoon" , state: "CO" }, Bob: {street: "Q Blvd" , zip: "24252" , city: "Nowhere" , state: "MN" }, ... // }, // ieure: { // joey: {street: "A ave" , zip: "55485" , city: "Hell" , state: "NV" }, William: {street: "Armpit Dr" , zip: "93301" , city: "Bakersfield" , state: "CA" }, }, }

Key space

Key space (Keyspace) is what unites all your data. All your column families are in key space. Your key space will probably match your application.

So, the key space can contain several column families, but this does not mean that they will somehow depend on each other. For example, they cannot be JOIN'ed, like tables in MySQL. Also, just because ColumnFamily_1 contains a string with the key "phatduckk", it does not mean that ColumnFamily_2 also contains it.

Sorting

So, we figured out what data containers exist, but another key element of the data model is how data is sorted. In Cassandra, you cannot make such queries as in SQL - you cannot specify how you want to sort the data when you make a selection (among other differences). The data is sorted as soon as you record it in a cluster and always remain sorted. This is a huge improvement in reading performance, but in exchange for this advantage, you need to make sure that you plan your data model in such a way that it is possible to satisfy your access schemes.

Columns inside rows are always sorted by column name. This is important, so I repeat: the columns are always sorted by name! How exactly the names are compared depends on the CompareWith parameter of the column family. By default, you have the following options: BytesType, UTF8Type, LexicalUUIDType, TimeUUIDType, AsciiType, and LongType. Each of these options treats column names as different data types, providing some flexibility. For example: using LongType will treat column names as 64-bit integers. Let's try and clarify this by looking at the data before and after sorting:

// , // Cassandra "" . // , - {name: 123, value: "hello there" }, {name: 832416, value: "kjjkbcjkcbbd" }, {name: 3, value: "101010101010" }, {name: 976, value: "kjjkbcjkcbbd" }

So, given that we use the LongType variant, these columns will look like this after sorting:

 < ColumnFamily CompareWith ="LongType" Name ="CF_NAME_HERE" />

// , // , {name: 3, value: "101010101010" }, {name: 123, value: "hello there" }, {name: 976, value: "kjjkbcjkcbbd" }, {name: 832416, value: "kjjkbcjkcbbd" }

As you can see, the column names were compared as if they were 64-bit integers. If we now use another version of CompareWith, we would get a different result. If we set CompareWith as UTF8Type, the column names would be treated as UTF8 strings and formed the following order:

 < ColumnFamily CompareWith ="UTF8Type" Name ="CF_NAME_HERE" />

// UTF8 {name: 123, value: "hello there" }, {name: 3, value: "101010101010" }, {name: 832416, value: "kjjkbcjkcbbd" }, {name: 976, value: "kjjkbcjkcbbd" }

A completely different result!

This sorting principle applies to supercolumns, but we have another dimension to work with: we define not only how super-columns should be sorted, but also how columns inside the supercolumns should be sorted. Sorting columns within super columns is determined by the value of the CompareSubcolumnsWith parameter. Here is an example:

// // { // name: "workAddress" , // value: { street: {name: "street" , value: "1234 x street" }, city: {name: "city" , value: "san francisco" }, zip: {name: "zip" , value: "94107" } } }, { // name: "homeAddress" , // value: { street: {name: "street" , value: "1234 x street" }, city: {name: "city" , value: "san francisco" }, zip: {name: "zip" , value: "94107" } } }

Now, if we decide to set CompareSubcolumnsWith and CompareWith to UTF8Type, we get the following result:

// { name: "homeAddress" , value: { city: {name: "city" , value: "san francisco" }, street: {name: "street" , value: "1234 x street" }, zip: {name: "zip" , value: "94107" } } }, { name: "workAddress" , value: { city: {name: "city" , value: "san francisco" }, street: {name: "street" , value: "1234 x street" }, zip: {name: "zip" , value: "94107" } } }

I want to note that in the last example, CompareSubcolumnsWith and CompareWith are both set to UTF8Type, but this is not necessary. You can combine the values of the CompareSubcolumnsWith and CompareWith parameters as you like.

And the last thing I want to mention in connection with sorting is that you can write your own class to do the sorting. The sorting mechanism connects independently ... you can set CompareSubcolumnsWith and / or CompareWith any suitable class name as soon as this class implements the org.apache.cassandra.db.marshal.IType interface (that is, you can create your own comparison scheme for sorting) .

Schematic example

Okay, now we have all the pieces of the puzzle, so let's put them together and make a simple blog app. We will model the application with the following specifications:

single blog support
there may be several authors
records contain title, body, unique label and publication date.
records can be associated with any number of tags
people can leave comments, but cannot register: they re-enter information about themselves each time (just simplify)
comments contain the text, the time when they were left, and the name of the commentator
it should be possible to show all posts in the reverse order of chronological order (the newest is above)
should be able to show all posts by tag in reverse order of chronological

Each of the following sections will describe the column family, which we will define in the key space of our application, show the definition in xml, say why we chose one or another sorting option (s), and also show the data of the column family as JSON.

Authors column family

Modeling the authors' column family is fairly basic; we will not do anything cool here. We will assign each author a string and a key and this will be the full name of the author. Each column in the row will represent a specific parameter of the author's profile.

This is an example of using strings as objects ... in this case, Author objects. With this approach, each column will serve as an object property. Very simple. I want to note that since there is no definition of how the columns should be represented in a row, we can define this definition ourselves.

We will retrieve rows from our column family using the key and select all columns for each row (that is, for example, we will not select only the first 3 columns from the row with the key “foo”). This means that for us it does not matter how the columns will be sorted, so we will use the BytesType sort, because it does not require any validation for the column names.

 < ColumnFamily CompareWith ="BytesType" Name ="Authors" />

BlogEntries Column Family

Again, the column family will behave like a simple key / value store. We will store one record in one line. Columns in the rows will serve as the recording parameters: title, body, etc. (as in the previous example). With a small optimization, we denormalize the tags in one column as a string, separated by commas. In the output, we will split the value of the column to get a list of tags.

The key of each line is a unique label (slug). So, to select a single record, we will look for it by this tag.

< ColumnFamily CompareWith ="BytesType" Name ="BlogEntries" />

TaggedPosts Column Family

So finally there will be something interesting. This column family will show us a new level. It will be responsible for storing links between tags and posts. It will not only store links, but also allow us to select all the blog entries by a specific tag, in a sorted order (remember everything we know about sorting?).

The peculiarity of the solution that I want to point out is that the logic of our application should attach the tag "__notag__" to each BlogEntry entry (I just invented it). This tag will allow us to use this family of columns as well for storing a list of all blog entries in a sorted form. This is a small trick that will make it possible to use only one column family for two selections: “show all recent posts” and “show all recent posts with the tag` foo` ”.

According to this data model, records with three tags will correspond to 1 column in 4 lines. One for each tag and one for the "__notag__" service tag.

Since we decided that we would display the list of records in chronological order, we need to make the column names of the type TimeUUID and set the CompareWith parameter to TimeUUIDType. This will sort the columns by time. So using queries like “get the last 10 entries tagged with` foo` ”will be a very efficient operation.

Now, when we want to display the last 10 entries (on the main, for example), we will need:

take the last 10 columns by the key "__notag__" (tag "all posts")
cycle through this set of columns
in the loop, we know that the value of each column is the key of the row in the BlogEntries column family
so we use this key to get the row for this entry from the BlogEntries column family. so we get all the data about the record
one of the columns in the BlogEntries row is named author and its value is the key in the Authors column family, and we use it to get the author’s profile data
So, we have data of a post and data of the author
next we break down the tagged column to get the list of tags
now we have everything to display this post (for now, without comments, this is a page of the list of posts, and not a specific post)

We can do this procedure using any tag ... it works for both "all records" and "records with the` foo` tag. " Seems not bad.

< ColumnFamily CompareWith ="TimeUUIDType" Name ="TaggedPosts" />

Column family Comments

The last thing we have to figure out is how to model comments. And here, finally, we need supercolumns.

We will have 1 line per post. As keys, we will use the same keys that were used for posts. In the lines we will have supercolumns, for each comment is your own. The names of supercolumns will be unique identifiers of the TimeUUIDType type. So we guarantee that all comments to the post are sorted in chronological order. Columns in each supercolumn will be comment parameters (commentator name, comment time, etc.)

So, it's pretty simple so far ... nothing extraordinary.

< ColumnFamily CompareWith ="TimeUUIDType" ColumnType ="Super" CompareSubcolumnsWith ="BytesType" Name ="Comments" />

Woot!

It's all. Our little blog app is modeled and ready to go. Get a little digestion, and you will end up with a very small piece of XML in your storage-conf.xml:

< Keyspace Name ="BloggyAppy" >   < ColumnFamily CompareWith ="BytesType" Name ="Authors" /> < ColumnFamily CompareWith ="BytesType" Name ="BlogEntries" /> < ColumnFamily CompareWith ="TimeUUIDType" Name ="TaggedPosts" /> < ColumnFamily CompareWith ="TimeUUIDType" Name ="Comments" CompareSubcolumnsWith ="BytesType" ColumnType ="Super" /> </ Keyspace >

Now all you have to do is figure out how to write and read data from Cassandra. This can be done using the Thrift Interface . The Cassandra API wiki page did a decent job of explaining how to work with it, so I won't go into all these details. But, in general, you can simply compile the cassandra.thrift file and use the generated code to access the API. You can also take advantage of a Ruby client or a Python client .

Okay ... I hope all this has made you feel what a hell of a supercolumn it really is and you will start creating cool applications.

From the translator: I tried to translate as close as possible to the original. I hope for constructive criticism.

Update : thanks to Honeyman for the fix and link

08/24/2010: Transferred to the NoSQL blog

Source: https://habr.com/ru/post/100075/

All Articles