Getting Started with Phoenix

HDInsight HBase clusters have added Phoenix support.  Phoenix adds support for SQL queries on top of an HBase cluster.  It does this by compiling your SQL query into a series of table scans returning a regular JDBC result set.

What is Phoenix?

Apache Phoenix originated at Salesforce.com as an internal project to make it easier to work with big data systems, in patacular HBase a NoSQL database in the Hadoop ecosystem.  Phoenix enables OLTP and analytics for low latency applications by combining standard SQL and JDBC APIs with full ACID transaction capabilities with the scheme-on-read, late-bound capabilities of the world of No SQL.

The Phoenix framework provides both client and server libraries.  On the server, Phoenix provides custom HBase co-processors for handling indexing, joins, transactions, and scheme management.  All features that HBase doens’t provide on its own. On the Client side, Phoenix provides a library which manages parsing and query plan selection before interacting with the HBase API converting SQL into SCAN, PUT, and DELETE operations that execute server side on the Phoenix co-processors.

Phoenix is widely supported in a number of different Hadoop distribution platforms including Hortonworks, MapR, and Cloudera.

How to Connect

Phoenix supports connecting with a JDBC driver.  To connect to an HDInsight HBase cluster using Phoenix create a connection as follows:

The connection string of jdbc:phoenix: is all that is needed.  The ZooKeeper nodes will be pulled from the base.zookeeper.quorum property in the hbase-site.xml file if present.  You can, however; directly specify your ZooKeeper nodes in the connection string if needed.  For example:

Creating Indexes

Phoenix provides the capability to create secondary indexes on top of HBase.  Secondary indexes, unlike primary indexes may have duplicate values.  HBase does not natively support secondary indexes, leaving just the row key available for scanning.

Secondary indexes can be created on both tables and views.  Secondary indexes will be kept up to date automaticity as data in the table changes.  Phoenix supports different types of indexes: covered, functional, global, and local.

Global indexes are great for read heavy use cases.  The performance hit for managing the index is taken at write time (during UPSERT or DELETE).  Phoenix intercepts the data table updates on write to build the index to update all index tables.  At read-time Phoenix will select the index table to use which produces the fastest query-time.

Local indexes are better for write heavy use cases.  All local indexes of a table are stored in shadow column families in the same data table.  Because of this local indexes store data on the same server preventing any network overhead during writes.  This comes at the cost of some overhead at read-time as every region must be examined for the data since the exact region location of an index is not readily known.

Phoenix with SQLLine

HDInsight includes a helpful utility called SQLLine which is a simple shell for executing SQL commands against a database.  This is a great tool to use when exploring and playing around with Phoenix.  To access SQLLine:

  1. ssh into an HDInsight HBase cluster.
  2. cd /usr/hdp/2.2.9.1-7/phoenix/bin
  3. ./sqlline.py

After connecting to the SQLLine client you can execute commands against the Phoenix database.

Here are some helpful commands:

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s