Friday, July 19, 2013

How to perform ingestion and querying in a NOSQL Enterprise-level environment

I will talk about what I have seen in recent projects (i.e. this may not apply 2 years from now..). There is often the need for a replacement of a pure Oracle instance (expensive, scale-up architecture) to a NoSQL solution (much cheaper, scales out). The problem is replacing the functionality underneath to appropriately: - query the data. - ingest the data.

 Tools

 Datastax seems to be the strongest contender at the moment (Jul, 2013), by offering NoSQL, Hadoop and Solr functionality. Cloudera and MapR are following closely also with search functionality embedded. Both Datastax and Cloudera search have tight integration with HDFS and takes care of replication and sharding transparently by using the pre-existing hdfs replication and sharding, and use SolrCloud for this. However in the case of Cloudera search you would need to install zookeeper to enable coordination between each of the solr nodes; Zookeeper being standard in an HBase installation. DataStax allows you to talk to Solr, however their model scales around the data model and architecture of Cassandra. Hadoop is also available, as a convenience (and a secondary-tier performance compared to Cloudera/MapR).

  Querying

 So coming back to our problem, it seems like Solr is more or less replacing SQL in these types of architecture. Secondary indexes in Hbase or Cassandra unfortunately are seldomly used when you have a fair number of columns, with unique data. Solr allows you to do a fair amount of querying, up to simple aggregations via its Stats package. Customizations of the Stats handler are also pretty common. Unfortunately more complex queries in Solr, like GROUP BY HAVING clauses, start to break down the model and are not feasible.

  Ingesting

 You would think that Cassandra being so fast at writing would be best in a Datastax environment to perform the loading/ingesting of the data. Unfortunately after some extensive testing, it appears not to be the case, and you're much better of loading the data via Solr, using SolrJ, than using the popular Cassandra clients (Hector, Astyanax, Thrift), which prove too slow as an API for ingestion.

So Solr seems to be your best bet for this.
There are a number of gotchas to be aware of:

-Use Trye datatypes as opposed to 'normal' ones
-SolrJ's poorly documented API: how to get more than 10 docs returned by default, how to set the different querying options, etc
-A lot of 'search-centric' options of Solr don't really apply when you re just looking for a replacement of SQL. Things that come to mind: use field queries instead of queries (to not use the Heap), faceting.
-Solr lets you create a nice replacement to populating drop-downs, when 1000's of choices are available, with its autocomplete type of feature (a la Google).