There are a lot of new features coming down the pipe with Riak 2.0 but the most important one (at least to me) is Riak Search 2.0.
What is Riak Search 2.0 exactly? Riak is a very simple key value store with AP properties in the CAP theorem land. It scales very well and thanks to Erlang it is extremely reliable. There are very few features (and this is a great thing for a data store) so this is why introducing Solr as te revamped search is kind of big. I have never used Solr before so watch me fail or maybe succeed indexing some dbpedia documents.
Getting the test data
Dbpedia is a data query interface (SQL like) to Wikipedia. I am going to query it for all the cities on the planet having population bigger than 50.000. The query looks the following way:
The data needs to be sliced up to individual JSON files so we can load them into Riak easily and make Solr index the files using the custom schema. After removing some header information the data from dbpedia looks the following:
You get an idea, it is a nested data structure an array of smaller hashes. I have processed it with Clojure to get one enty per file, using uuids to have unique file names (keys).
This produces a bunch of JSON files so I can upload it to Riak. Before we get there, lets start up and configure our Riak service.
I am using the Yokozuna release, version 0.14.0. After downloading the source we need to create a devrel with two nodes. I assume you have Erlang and the build tools installed.
There might be some libs missing but I don't want to go too much into the details about the operating system specific part of the story. After the dev nodes are created we need to configure Riak and enabled search. I prefer to use LevelDB as the persistent store and I would like to make Riak listen on all of the available interfaces, making our lives easier in a virtualized environment. Let's do all of these.
After the configuration part is done, start up the nodes, make them into one cluster and we are almost ready to start to shove data in.
Checking the member status to verify if the data is evenly distributed among the nodes:
In this section we are going to create a Solr schema and index so that we can index the documents. Think about the schema as the merit how much Solr understands the data. It can be configured to reference individual elements in complex nested data structures. The data type can be also configured, that makes range queries possible for numeric data. I dont wan't to go too much into the details of Solr, it is worth to spend few hours on the documentation, I am just scratching the surface in this post.
First thing first, we need to create a schema that is used by the index. I am not really a Solr expert but here is what I came up with:
Upload the schema to Riak and creating the index:
In Riak 2.0 there is a new feature called bucket types that allows groups of bucket to have the same configuration details.
Creating a new bucket type called cities and activating it:
The search index can be queryed two ways:
- using the Solr interface
- using Riak
The syntax is the same, lets find the first 100 cities with population between 51000 and 52000, display only the name and the population and order it by the population. With Solr very comprehensive query syntax you end up with somethig like this:
I am a huge fan. Well, Riak is my favorite simple key-value store (o hai GET/PUT) and with Solr it makes life really easy to index what is stored in your system. I think it might be a bad idea to run Solr on the same nodes as Riak, finding bugs would be painful, but other than that I am pretty happy with the state of Yokozuna now.
The kudos go to #riak on freenode especially to @coderoshi and @nikore.