Wednesday, 22 June 2011

NOTE: SYSTASK With An Unknown Number of Calls

In an earlier article (and the associated article on security) I extolled the virtues of SYSTASK for doing operating system activities in parallel. I gave an example that executed two gzip commands in parallel. But what would you do if you didn't know how many files you needed to zip?

Well, let's assume you have a table containing a list of files (WORK.FILES in the example below); we need to issue a SYSTASK statement for each row in the table; and then we need to issue a WAITFOR statement that refers to the names of each of the SYSTASKs so that we don't proceed any further until all of the zips are complete.

data files;
  file='Alpha.csv'; output;
  file='Beta.csv'; output;
  file='Gamma.csv'; output;
run;

%macro zippem(data=,var=);
  data _null_;
    set &data end=finish nobs=numobs;
    length stmt $256;
    stmt = cat('systask command "gzip '
              ,&var
              ,'" nowait taskname=TSK'
              ,putn(_n_,'Z5.')
              ,';'
              );
    call execute(stmt);
    if finish then
    do;
      stmt = 'waitfor _all_';
      do i = 1 to numobs;
        stmt = cat(trim(stmt),' TSK',putn(i,'Z5.'));
      end;
      stmt = cat(trim(stmt),';');
      call execute(stmt);
    end;
  run;
%mend zippem;

%zippem(data=files,var=file);


The macro produces the following log output:

NOTE: CALL EXECUTE generated line.
1 + systask command "gzip Alpha.csv" nowait taskname=TSK00001;
2 + systask command "gzip Beta.csv " nowait taskname=TSK00002;
NOTE: LOG/Output from task "TSK00001"
> gzip: Alpha.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00001"
3 + systask command "gzip Gamma.csv" nowait taskname=TSK00003;
4 + waitfor _all_ TSK00001 TSK00002 TSK00003;
NOTE: LOG/Output from task "TSK00003"
> gzip: Gamma.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00003"
NOTE: LOG/Output from task "TSK00002"
> gzip: Beta.csv: No such file or directory
NOTE: End of LOG/Output from task "TSK00002"


Ignoring the fact that my files don't exist(!), you can see that the output from each command is echoed to the log (useful). It's a simple macro, but it can speed-up your jobs by a significant amount. You can use the template code shown above for many purposes.

Monday, 20 June 2011

NOTE: SYSTASK Is Great, If You're Allowed To Use It! (XCMD)

In my previous posting I featured the SYSTASK statement as a great means of executing operating system commands in parallel. Statements such as SYSTASK and CALL SYSTEM allow any operating system command to be executed and so they can be dangerous in the wrong hands. Paul Homes recently wrote an excellent blog post about the whole subject of issuing operating system commands from SAS and the restrictions that can be placed upon doing so. Recommended.

NOTE: With SYSTASK, Even Men Can Multi-Task!

I've been doing a lot of file manipulation recently (hence my observations on INFILE's FILEVAR). I've become a great fan of SYSTASK for executing operating system commands. The key element to SYSTASK's capabilities is that it can execute commands in parallel, i.e. asynchronously. So, if you have a number of large files that you want to do time-consuming tasks upon (such as compress or perform a word count), SYSTASK can do them in parallel and you'll get your results quicker (if your system has multiple processors and/or cores, and decent I/O performance).

Here's a simple (unix) example that zips two files in parallel:

systask command "gzip /user/home/andy/alpha.csv" nowait taskname=alpha;

systask command "gzip /user/home/andy/alpha.csv" nowait taskname=beta;

waitfor _all_ alpha beta;

%put Both files are now zipped;


Note the NOWAIT keyword on each SYSTASK statement; this instructs SAS to continue execution rather than waiting for the command to finish. The WAITFOR statement (as its name implies) forms a synchronisation point in your code. In the example above, it will wait for "all" of the tasks named on the WAITFOR statement before allowing execution to continue beyond the WAITFOR statement.

In SAS 9.1 there's a restriction whereby you cannot use a tilde (~) or a wildcard (*). Aside from that, SYSTASK is a terrific means of speeding-up your SAS code and making greater use of your computing resources.

Monday, 6 June 2011

NOTE: Reading Multiple Files (with irregular names)

I was introduced to the INFILE statement's FILEVAR parameter recently. It seems it's a great way to read multiple files into a DATA step. Hitherto I had tended to use a widlcard in the FILEREF.

To read multiple files with similar names, you can simply put a wildcard in the FILENAME statement thus:

filename demo '~ratcliab/root*.txt';

If I have files with the following names in my home directory:

root1.txt
root2.sas
root3.txt


...then the first and third will be read by the following DATA step:

17 filename demo '~ratcliab/root*.txt';
18
19 data;
20   length string $256;
21   infile demo;
22   input string $256.;
23 run;

NOTE: The infile DEMO is:
File Name=/home/ratcliab/root1.txt,
File List=/home/ratcliab/root*.txt,
Access Permission=rw-r--r--,
File Size (bytes)=10

NOTE: The infile DEMO is:
File Name=/home/ratcliab/root3.txt,
File List=/home/ratcliab/root*.txt,
Access Permission=rw-r--r--,
File Size (bytes)=10

NOTE: 1 record was read from the infile DEMO.
The minimum record length was 9.
The maximum record length was 9.
NOTE: 1 record was read from the infile DEMO.
The minimum record length was 9.
The maximum record length was 9.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.DATA1 has 1 observations and 1 variables.


That's all great if your files have similar names. If not, ask the FILEVAR parameter to step forward...

NOTE: More on LENGTH Functions

I got a good amount of feedback on my recent article on LENGTH functions, including a blog comment from Rick@SAS. In addition to providing some useful detail on LENGTH functions within SAS/IML, Rick also suggested:
I think a source of confusion with character missing values is that it doesn't matter how many blanks are in a string (0, 1, 2, or 20), the strings are all equivalent and all equal to the character missing value. That's why you can say
IF x = "" THEN...
and the statement works regardless of the length of x
Good point, Rick. I still recommend using the MISSING function, not least for the clarity of purpose, but Rick's use of the character constant ("") certainly removes the uncertainty around the LENGTH function.

NOTE: Early Bird Discount Ending Soon (SAS Professionals 2011)

The 2011 SAS Professionals Convention is to be held July 12th - 14th in Marlow. Book before June 13th at the reduced rate of £100+VAT. If you plan to go (why wouldn't you?), don't miss the early bird discount.

Sadly, for the third year running, events outside of my control mean that I won't be going. Every year I start-off determined to go, and then something crops up in the last couple of moth to prevent it. Bah!

Wednesday, 1 June 2011

NOTE: Length Functions (Something Missing?)

How many functions to tell you the length of a value do you need? At least six apparently! SAS provides LENGTH, LENGTHC, LENGTHM, LENGTHN, KLENGTH and %LENGTH. Why?...

As we've all discovered to our cost, the basic LENGTH function accurately tells us the length of a character string (excluding trailing blanks) unless the string is completely blank, in which case LENGTH misleadingly returns the value 1. That's why I always use LENGTHN; it returns the value zero for a blank string.

I rarely use the others but, for the record, LENGTHC returns the length of a string including trailing blanks; but beware because it returns the value one when supplied with a null string as input.

The LENGTHM function is a slightly different beast because it returns the declared length of the variable rather than of its contents, i.e. it returns what was specified on (or implied for) the variable's LENGTH statement. KLENGTH is another oddity. In essence, it is the DBCS equivalnet of LENGTH. And %LENGTH is the macro equivalent of LENGTHN, i.e. it returns zero for a null/blank string.

Oh, there's a %KLENGTH too. And SAS/IML has a length function too, but let's not go there!

Why might we be using length functions? One popular use is to test if a variable is missing or null. For these cases, the MISSING or NMISS functions are often the best option - not least because their names make the purpose of their usage far clearer than using a length function.

The MISSING function returns 1 if the value passed to it is missing. The value passed to it can be numeric or character. A chracter string is deemed to be missing if it is all blank or has zero length. Perfect! This is a far better choice than any of the length functions if you want to test  avariable for a missing value.

NMISS returns the number of missing numeric values.

Finally, for completeness, I should mention CALL MISSING. You can use this routine to set character or numeric values to missing, though very few of us do.