Thursday, 8 December 2011

NOTE: Hash Tables, An Introduction

In my recent series of articles on testing I used the hash table to provide a means to write/update a table from within a DATA step without demanding changes to the DATA statement, etc. I've had some very kind feedback on the series, plus some requests to explain what the hash table does and how it works. So here goes...

The hash table was introduced to the DATA step as a means of performing fast, in-memory lookups. The fact that the hash table is a "component object" means that there's a steeper than usual learning curve for this feature. The SAS Language Reference: Concepts manual provides an introductory section on the hash table, but for many DATA step coders the hash table remains unused. A better route to getting started with hash tables is the "Getting Started with the DATA Step Hash Object" paper in the DATA Step sub-section of R&D's SAS/BASE section of the SAS Support web site (follow the link to Component Objects and Dot Notation).

In a nutshell, the hash table is a lookup table that's stored (temporarily) in memory and allows you to search for values within it and thereby get associated values returned. Let's introduce ourselves to the hash table by taking a two step approach: firstly we'll create the hash table, secondly we'll use it for lookups against each of the rows in our input table. Our DATA Step will look like this:

data result;
  set input;
  if _n_ eq 1 then
  do; /* It's first time through, so let's create hash table */

    <create the hash table>
  end;
  /* For each row in input table, do a lookup in the hash table */

  <do a lookup>
run;


Let's make ourselves some test data and assume it contains the sales from our small car sales operation last week:

data SalesThisWeek;
  length make $13 model $40;
  infile cards dsd;
  input make model;
  cards;
Jaguar,XJR 4dr
Land Rover,Discovery SE
Jaguar,XKR convertible 2dr
;
run;


We have a price list for all of the cars we sell; it's in sashelp.cars and contains variables named make, model and invoice. Frustratingly, the MODEL column contains leading blanks, so we use a quick DATA Step to get rid of them, thereby creating work.cars.

data work.cars; set sashelp.cars; model=left(model); run;

We want to load the price list into a hash table, then lookup each of our sold cars to find its invoice value. Here's the code to <create the hash table>:

  DECLARE HASH notehash (DATASET:'work.cars');
  notehash.DEFINEKEY('make','model');
  notehash.DEFINEDATA('invoice');
  notehash.DEFINEDONE();


Woh! That code looks like no other SAS code we've ever seen!! That's because the hash table is a "component object" and the syntax for dealing with components objects differs from mainstream DATA Step syntax. It's called "dot notation". It quickly makes sense once you get over the fact that it's different.

The first line tells SAS that we want to create a new hash table. Hash tables only exist temporarily in memory for the duration of the DATA Step. We use the DECLARE statement to begin to create a new component object; the first parameter (HASH) says what kind of component object we want to create; the second parameter (notehash) is an arbitrary name that we have chosen for the hash table; within the brackets we have told SAS that we're going to use some of the columns of the work.cars table as our lookup table.

The following two lines tell SAS a bit more about how we'd like to load and use the hash table; the fourth line (with DEFINEDONE) tells SAS we've nothing more to tell it about the hash table.

When we use dot notation we type i) the name of a component object, ii) an action we want to perform on the object, and optionally iii) parameters for the action. Parts (i) and (ii) are separated by dots, and the parameters (iii) are enclosed in brackets.

When we create a hash table, we have to declare it, then we have to specify a) the key column(s), i.e. the column(s) that we'll use to find something in the hash table, and b) the data column(s), i.e. the column(s) whose values will be returned once the key values are found in the hash table. In our case, MAKE and MODEL are our key columns, and INVOICE is our data column.

After specifying our key and data columns (with the DEFINEKEY and DEFINEDATA actions) we tell SAS that we're done by performing the DEFINEDONE action on the hash table.

The dot notation is different to what we're used to, but it's not too tricky to get your head around.

Now that we've created our hash table in memory, for use during the DATA Step, all we need to do now is use it. We lookup things in the table by performing the FIND action on the hash table. If SAS finds the key value(s) in the hash table, it will automatically put the associated data value(s) into the data variable(s) in the DATA Step. So, in our case, we need a variable in the DATA Step named INVOICE. If we don't create that variable prior to using the hash table we'll get an error.

When we do a FIND, we're given a return code value that tells us whether SAS found the key(s) in the hash table. A return code value of zero tells us that all is well and the value was found; any other value tells us that SAS did not find the value. So, our code to <do a lookup> will look like this:

  length invoice 8;
  rc = notehash.FIND();
  if rc ne 0 then
    put "Oh dear, we sold something we can't identify";


Notice that there's no direct reference to INVOICE when we do the find. The fact that FIND will put a value into INVOICE is implicit from our preceding DEFINEDATA.

Put all together, our code looks like this:

/* Create our test data */
data SalesThisWeek;
  length make $13 model $40;
  infile cards dsd;
  input make model;
  put make $quote22. model $quote50.;
cards;
Jaguar,XJR 4dr
Land Rover,Discovery SE
Jaguar,XKR convertible 2dr
;
run;

/* Strip leading blanks from MODEL */
data work.cars; set sashelp.cars; model=left(model); run;

/* Add invoice values to sales by using lookup */
data result;
  set SalesThisWeek;
  keep make model invoice;
  if _n_ eq 1 then
  do; /* It's first time through, so let's create hash table */
    DECLARE HASH notehash (dataset:'work.cars');
notehash.DEFINEKEY('make','model');
notehash.DEFINEDATA('invoice');
notehash.DEFINEDONE();
  end;
  /* For each row in input table, do a lookup in the hash table */
  length invoice 8;
  rc = notehash.FIND();
  if rc ne 0 then
    put "Oh dear, we sold something we can't identify";
run;


Once you've got the basic hang of hash tables, the two best sources of reference information are:

a) The hash table tip sheet, available from R&D's SAS/BASE section of the SAS Support web site (see the link to the tip sheet over on the right hand side of the page)

b) Component Objects: Reference in the SAS Programmer's Bookshelf

There are many ways to perform lookups in SAS. Some examples would be i) formats, ii) the KEY= parameter of the SET statement, iii) table joins. The hash table is another option which can offer advantages in many cases. Have fun...