ARCHIVE: W18 SAS Workshop: Creating graphs in SAS

PDF copy of the complete SAS syntax used in this workshop

SAS creates graphs – no WAY!!!  Yes WAY!!  And it does a wonderful job and allows you to customize so many different aspects of a graph.  However, like many things in SAS, there is a bit of a learning curve associated with this part of the SAS program AND..  turns out SAS/GRAPH is not available with the University Edition of SAS.  You can still create graphs in the University Edition but just not the entire array of them.

To view the full array of graphes that are available in SAS/GRAPH, please visit the SAS Graphics Gallery.

For the purposes of this workshop, I will discuss ODS graphics and we will create a histogram that can be run in both SAS Studio and PC SAS.  Time permitting I will showcase a website on the SAS support site to demonstrate the capabilities of SAS/GRAPH – not available on University Edition, but available to those of us running PC SAS or SAS on a server.

ODS Graphics

ODS you may recall from previous chats is the acronym for SAS’ Output Delivery System.  It is our gateway to saving our outputs in a variety of formats, PDF, RTF, Excel, etc…  ODS is also the engine behind part of SAS graphics.  With many PROCedures, by turning on ODS graphics you can obtain a number of plots and graphics specific to that PROCedure.  These are all available in University Edition – as long as you have access to the PROC.

To turn on the ODS graphics, simply type
ods graphics on;

It may be on by default, but we can ensure that it’s on by running the one line of code.  To turn it off at the end of a specific PROCedure, type:
ods graphics off;

PROCedures that support ODS graphics and available in University Edition SAS Studio:

ANOVA NPAR1WAY
BOXPLOT ORTHOREG
CALIS PHREG
CLUSTER PLM
CORRESP PLS
FACTOR POWER
FMM PRINCOMP
FREQ PRINQUAL
GAM PROBIT
GENMOD QUANTREG
GLIMMIX REG
GLM ROBUSTREG
GLMPOWER RSREG
GLMSELECT SEQDESIGN
KDE SEQTEST
KRIGE2D SIM2D
LIFEREG SURVEYFREQ
LIFETEST SURVEYLOGISTIC
LOESS SURVEYPHREG
LOGISTIC SURVEYREG
MCMC TPSPLINE
MDS TRANSREG
MI TTEST
MIXED VARCLUS
MULTTEST VARIOGRAM
NLIN

Source:  https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#statug_odsgraph_sect021.htm

Let’s try one example with and without ODS graphics to see what we get.  You can download a PDF copy of the SAS syntax here or copy the following syntax:

/* Working with a dataset in the SAS Help which contains
blood pressure measurements for males and females.
Let’s read it and save it locally on our own systems */
Data heart;
    set sashelp.heart;
Run;

/* Run a Proc CONTENTS to get a sense of what information
can be found in this dataset */
Proc contents data=heart;
Run;

/* Since ODS graphics may be on by default
Let’s turn it off to see what the Proc TTEST
gives us without graphics */
ods graphics off;

/* Let’s run a TTest to see whether there are differences
between males and females for the diastolic measure of BP */
Proc ttest data=heart;
    class sex;
    var diastolic;
Run;

/* Now let’s turn on the ODS graphics */
ods graphics on;

/* Rerun the Ttest procedure – making no changes
to the code */
Proc ttest data=heart;
    class sex;
    var diastolic;
Run;

Each PROCedure listed in the table above will produce different plots related to the analysis at hand.  For more information on the graphs produced by the PROC, please refer to the PROC documentation.  The link in the table above will take you to the ODS graphics page within the PROC.

Create a Histogram

We will continue to work with the Heart dataset in the SAS Help directory.  Now we are looking to create a HISTOGRAM from scratch, rather than using the ODS graphics option.  The graph we are looking to create will contain a histogram for diastolic and systolic superimposed for only the Males in this dataset.

Subsetting the data

We have a large dataset with 5209 observations.  For this exercise we would like to create a subset of this dataset that only contains the males.  There are a number of ways to do this.  I will demonstrate 2 different ways.

If statement to subset

Data male_data;
    set heart;
    if sex = “Male”;
Run;

Creating a new dataset called male_data, reading the dataset heart – which we created earlier, the IF statement is saying only keep the observations where the variable called sex has a value of Male.

When you run this piece of code our new dataset, male_data, now contains only 2336 observations.  If you need to see whether this was successful, you can run a Proc print – but restrict the number of observations to see by adding an (obs=xx) at the end of the Proc print statement:

Proc print data=male_data (obs=20);
Run;

OR run a Proc Freq on sex to see whether you have any females in the dataset:

Proc freq data=male_data;
    tables sex;
Run;

Using PROC SQL to subset

If you’ve ever programmed in SQL, you’ll know the merits and advantages of using the SQL language.  To use SQL you should have a great command on your data structure.  In our case, we have a dataset called heart, we want to create a new dataset called male_data, with all observations that have the value of “Male” in the variable sex.  In SAS, we have a PROCedure called SQL that allows you to use SQL coding.  Here is the complete code, copy and run it, and we’ll work through each line of the code below.

Proc sql;
    create table male_data as
    select * from heart 
    where sex = “Male”;
Quit;
Run;

Notice that the 3 lines after the Proc statement is one line – in other words there is only one ; at the end of the 3 lines.  Yes, I could have easily have kept all three lines as one line of code, but sometimes it is easier to see it in separate lines to see what’s happening.

create table male_data as –  creating a new table or dataset in SAS and we’re calling it male_data  – this does the same as the Data male_data;  in our previous subsetting example.

select * from heart – as you read through the dataset, select all the observations in the dataset heart – this would be similar to our set heart;  in the previous example.

where sex = “Male”; – only keep those observations that have a value of Male in the variable called sex.  Similar to our   if sex = “Male”;  in the previous example.

Run the code and double-check again by either running the Proc print code or the Proc freq code to ensure that our male_data dataset only contains males.

Which way to subset?

Both methods provide you with the same resulting dataset.  Which way you select, is really up to you.  Reviewing a blog post from 2010 written by SAS, they have a great analogy that I will link to and repost here.

Using the Data Step is like going grocery shopping and going directly to the aisles where the items you need are located.  You know where you need to go.

Using Proc SQL, is like going grocery shopping, but this time you give your list to an employee and you have no control as to how they are acquiring the items on your list.  Your grocery list will be completed, but you don’t know how it was completed.

Very interesting analogy!  The Data Step is procedural whereas SQL is not.  Having said that, you may be asking, why would anyone use Proc SQL?  For many people it is comfort!  Many SAS programmers learned SQL first and will continue to use it, than move to the Data Step.  Very much like myself, I learned to code in SAS and have a hard time moving to either Enterprise Guide or SAS Studio.

Either way works – pick the one you prefer!

Creating the Histogram

We will be working with our male_data dataset we just created.  To create the histogram we will be using the PROC SGPLOT.   Here is the complete coding – let’s copy this into our SAS editor, run it, and discuss the coding line by line below.

Proc sgplot data=male_data;
    histogram diastolic / transparency= 0.7 binwidth=10;
    histogram systolic / transparency= 0.5 binwidth=10;
    yaxis grid;
    xaxis display=(nolabel);
Run;

histogram – statements are telling SAS the type of graph we are looking for in the output window.  With a Proc SGPLOT you can create histograms, scatter plots, horizontal bars, vertical bars, and time series graph.

In our example we are creating 2 histograms – one for the diastolic measure and a second for the systolic measure.  In both instances we are adding 2 options – one for the transparency of the bars and the second is the width of the BINs.  After we run the graph for the first time, go back and change the BINWIDTH to see how the graph changes.

yaxis – is adding the y-axis gridlines  – a label for the y-axis will be presented by default.

xaxis – since the label of the xaxis will be presented by default, with our code we are asking that it not be displayed.

One more plot – Scatter with an Ellipse

Using the same male dataset – let’s create a scatter plot with the 2 blood pressure measures and ask SAS to draw a 95% prediction ellipse.  Copy and run the following SAS syntax:

proc sgplot data=male_data;
    scatter x=diastolic y=systolic;
    ellipse x=diastolic y=systolic;
    keylegend / location=inside position=bottomright;
run;

Scatter with an x= and y= will create a scatter plot with, in our example, the diastolic measure along the x-axis and systolic along the y-axis.

Ellipse will draw a 95% prediction ellipse around our data as specified by the x-axis and y-axis.  You can also

keylegend – an option that places the legend inside the graph and on the bottom right side of the graph.  Try changing the position to see what happens.

Conclusion

Creating graphs in SAS can be a fun challenge.  If you are using University Edition SAS Studio, there is a limitation of what you can do, since the package SAS/GRAPH is not available to you.  But the ODS graphics are available along with some of the more basic Graphing features.  This post demonstrated the use of ODS graphics, and worked through 2 examples using Proc SGPLOT.

Name

Principal Component, Cluster , and Discriminant Analyses

The goal of this workshop and blog post is to review 3 different multivariate analyses.  We will use one common dataset to showcase the different purposes of the analyses and to showcase the different PROCedures available in SAS to conduct each analysis.

The dataset we will be using is the Fisher Iris dataset (1936), originally collected by Dr. E. Anderson and used to derive discriminant analysis by Dr. Ronald Fisher.  The dataset contains measures of petal length, petal width, sepal length, and sepal width on 50 plants of 3 varieties of Iris’.  The dataset is available within the SAS Help.  To access this dataset you will need to use the dataset name:  sashelp.iris.

Exploratory and Explanatory Analyses

When we think of statistics, most of us tend to think of our traditional hypothesis-driven analysis.  So, the ANOVAs, regressions, means comparisons, and the list goes on.  These are types of Explanatory Analyses.  There is another world of statistics, referred by some as Exploratory Analyses – those analyses that are not driven by a hypothesis.  Exploratory analyses are used more for describing relationships among variables or measures that were taken during a trial or in a dataset.  Principal Component Analysis or PCA, and Cluster Analysis are two examples of exploratory analyses, whereas discriminant analyses falls into the explanatory analysis bucket.

Principal Component Analysis (PCA)

Please review the PCA blog post for more details regarding this analysis.  This post will not provide the same level of detail but will form the basis of using the same dataset across three different analyses.

The roots of the PCA were found in 1901 and was developed by Karl Pearson.  Its primary role is top reduce the number of variables used to explain a dataset.  Factor analysis (FA) is a related process and has the same goal.  Many people confuse these two and tend to use the terms factor analysis and PCA interchangeably, when the two analyses are similar but not interchangeable.  I’ve listed a few of the primary differences between PCA and Factor analysis:

  1. Both analyses begin with a correlation matrix.  PCA maintains the diagonal of the correlation matrix as 1’s, whereas Factor analysis replaces to provide a measures of the relationship of each variable with the others.
  2. PCA – total variance among the variables is explained, FA common variance shared is the basis of the analysis
  3. PCA is less complex mathematically compared to FA
  4. PCA is one procedure, whereas FA is a family of procedures

SAS code and output using the IRIS dataset

/* For this workshop we will use the IRIS dataset */
/* Fisher’s dataset can be found in the SASHELP */
/* Library. The datsaet name is sashelp.iris */
/* SASHELP is the permanent SAS directory */

Proc print data=sashelp.iris;
Run;

/* Let’s get a sense of relationships that may exist */
/* in the dataset. We will use PROC SGPLOT to visualize */

Proc sgplot data=sashelp.iris;
    scatter x=SepalLength y=PetalLength; * / datalabel=species;
Run;

/* We will run the PROC PRINCOMP as we did in the */
/* previous workshop. Options we are using include */
/* plots=all to show all plots available in the PROC */
/* n = 3 – we will start without this option and */
/* and then add it back to see only the 3 components */

ods graphics on;
Proc princomp data=sashelp.Iris standard plots=all n=3;
    var SepalLength PetalLength SepalWidth PetalWidth;
Run;
ods graphics off;

To view the output.  The output explanations will be the same as the explanations reviewed in the last post – only a different dataset.

Cluster Analysis

Cluster analysis is a multivariate analysis that does not have ah hypothesis.  We are interested in seeing whether there are any natural clusters or groups of the data.  Clusters can be based on either the variables or measures collected in the dataset, OR they can be based on the observations within the dataset – variables or observations.

Clustering techniques will use two processes: distances and linkages.  Being familiar with these terms may help you to select the most appropriate clustering technique for your data.

Distance:  quantitative index defining the similarities of the clusters remaining in the analysis at each step.

Linkage: two clusters that have the smallest distance between them as determined by particular distance measures are then linked together to form a new cluster.

Standardizing the variables to be used in a cluster analysis is essential.  Clustering techniques use some measure of “distance”, ensuring that all the variables are on the same level, will ensure a better clustering.

There are 2 broad types of clustering techniques used in Cluster Analysis:

  • Hierarchical
  • K-means
  1. Hierarchical clustering.
    Cluster or group a small to moderate number of cases based on several quantitative attributes  the groups are clustered on a given set of variables – so when we talk about these clusters we can only discuss their merits based on the variables used to create the groups.  Remember context!
  2. K-means
    Creating cluster from a relatively large number of cases based on a relatively small set of variables.  K-means uses an iterative technique, cases are added to a cluster during the analysis rather than at the end – this allows some cases to shift around before the analysis is complete.  You also need to specify how many clusters you want as a result with K-means clustering.

SAS code and output using the IRIS dataset

In SAS there are 2 PROCedures that are commonly used for Cluster Analysis:

PROC Cluster and PROC Fastclus:

Directly from the SAS Online documentation:
“The CLUSTER procedure is not practical for very large data sets because, with most methods, the CPU time is roughly proportional to the square or cube of the number of observations. The FASTCLUS procedure requires time proportional to the number of observations and can therefore be used with much larger data sets than PROC CLUSTER. If you want to hierarchically cluster a very large data set, you can use PROC FASTCLUS for a preliminary cluster analysis to produce a large number of clusters and then use PROC CLUSTER to hierarchically cluster the preliminary clusters.”

When you create clusters – in any package – it is handy to calculate the means of the clusters and to run them through a Frequency analysis – essentially we want to be able to review some descriptive statistics on our new groups or clusters.  PROC Fastclus saves all of this information for us be default or as part of the PROC coding, whereas with PROC Cluster you need to add in a few extra steps.  Part of the coding that is commonly used includes a SAS Macro with PROC Cluster to run these descriptive statistics on our output clusters.

However, be assured that the output is the same whether you use PROC Fastclus or PROC Cluster with the macro.  For simplicity, we will only use the PROC Fastclus syntax for our example.

/* Cluster Analysis */
/* Creating 2 clusters, saving the results in a new dataset called CLUS */
/* Try a Proc PRINT to see what is found in the new dataset CLUS */
Proc fastclus data=sashelp.iris maxc=2 maxiter=10 out=clus;
    var SepalLength SepalWidth PetalLength PetalWidth;
Run;

/* Using the resulting dataset to get a feel for who landed in what cluster */
Proc freq data=clus;
    tables cluster*species;
Run;

/* Creating 3 clusters, saving the results in a new dataset called CLUS */
Proc fastclus data=sashelp.iris maxc=3 maxiter=10 out=clus;
    var SepalLength SepalWidth PetalLength PetalWidth;
Run;

/* Using the resulting dataset to get a feel for who landed in what cluster */
Proc freq data=clus;
  tables cluster*Species;
Run;

/* To obtain a graphical presntation of the clusters we need to run the */
/* Proc CANDISC to get the information needed for the graphical output */

Proc candisc anova out=can;
    class cluster;
    var SepalLength SepalWidth PetalLength PetalWidth;
    title2 ‘Canonical Discriminant Analysis of Iris Clusters’;
Run;

Proc sgplot data=Can;
    scatter y=Can2 x=Can1 /group=Cluster ;
    title2 ‘Plot of Canonical Variables Identified by Cluster’;
Run;

To view the resulting output.

Extra piece of SAS code.  If you need to standardize your variables before putting them into a Cluster analysis here is a sample piece of code that you can use:

/* If you need to standardize your variables – this is how you would do it */
Proc standard data=sashelp.iris out=iris mean=0 std=1;
    var SepalLength SepalWidth PetalLength PetalWidth;
Run;

/* Run a Proc PRINT to see what happened to your data and what changes happened */
Proc print data=iris;
Run;

/* Run a Proc MEANS to check whether the standardization worked or not */
Proc means data=iris;
    var SepalLength SepalWidth PetalLength PetalWidth;
Run;

Discriminant Function Analysis

As noted earlier, this analysis is not an exploratory analysis but an explanatory analysis.  In fact it is very similar to an Multivariate ANOVA or MANOVA.  It does however, have 2 distinct but compatible purposes:

  1. To determine whether the characteristics used to define the groups hold true or not
  2. To classify or predict the group membership of new observations based on the discriminant function.

So what does discriminant function do?  Essentially it creates a weighted linear combination of the variables used in the analysis which is then used to differentiate or group observations into groups.  Logistic regression comes to mind when you define discriminant analysis, however with logistic regression, the predictors can be quantitative or categorical and the fitted curve is sigmoidal in shape.  Discriminant analysis can only use quantitative variables and all the assumptions of a general linear model must be met.  So yes that means residual analysis – normality, homogeneity of variances, ….

One of the biggest challenges with discriminant analysis is sample size!  The smallest group in your dataset MUST exceed the number of predictor variables by a “lot”.  Papers have suggested at least 5 X or at least 10 X.

So, in the end discriminant analysis will essentially create a regression equation from your data that will “discriminate” observations into 2 groups – a variable in your dataset.  Let’s look at the example to get a better feel for this.

SAS code and output using the IRIS dataset

/* Discriminant Analysis – Fisher’s Iris Data */
Proc discrim data=sashelp.iris anova manova listerr crosslisterr;
    class Species;
    var SepalLength SepalWidth PetalLength PetalWidth;
Run;

To view the resulting output.

Conclusion

A quick review of 3 different types of multivariate analyses using SAS and the same dataset.  Each analysis has a different purpose.  Please ensure that you use the most appropriate analysis for your research question!

Name

 

Principal Component Analysis in SAS

Many statistical procedures test specific hypotheses.  Principal Component Analysis (PCA), Factor analysis, Cluster Analysis, are examples of analyses that explore the data rather than answer a specific hypothesis.  PCA examines common components among data by fitting a correlation pattern among the variables.  Often used to reduce data from several variables to 2-3 components.

Before running a PCA, one of the first things you will need to do is to determine whether there is any relationship among the variables you want to include in a PCA.  If the variables are not related then there’s no reason to run a PCA.  The data that we will be working with is a sample dataset that contains the 1988 Olympic decathlon results for 33 athletes.  The variables are as follows:

athlete:  sequential ID number
r100m:  time it took to run 100m
longjump:  distance attained in the Long Jump event
shotput:  distance reached with ShotPut
highjump:  height reached in the High Jump event
r400m: time it took to run 400m
h110m:  time it took to run 110m of hurdles
discus:  distance reached with Discus
polevlt:  height reached in the Pole Vault event
javelin:  distance reached with the Javelin
r1500m:  time it took to run 1500m

Let’s start with a PROC CORR to review the relationships among the variables.

Proc corr data=olympic88;
Run;

By reviewing the output found here,  we can see that there are a number of significant relationships suggesting that a PCA will be a valuable method of reducing our data from 10 variables to 2 or 3 components.

There are a few different PROCedures available in SAS to conduct a PCA.  My preferred PROC is PRINCOMP – short for principal components.  Let’s start with the basic syntax:

Proc princomp data=olympic88;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin                    r1500m;
Run;

The output starts with the same correlation matrix we created using the PROC CORR.  Although you’ll notice that there are no p-values available here.  Although we see the correlations we do not know whether they are significantly different from 0 or not.  We also have the Simple Statistics available – Means and Standard Deviations.

Our next table is a table of the eigenvalues for each component.  So let’s step back and talk about what this analysis is really doing for us.

Imagine my cloud of data – throw all the data in the air – there is variation due to the different events and there is variation within each event attributed to the performance of the different athletes.  Our goal with PCA is to reduce our data from the 10 events down to 2 or 3 components that represent these 10 variables (events).  Back to my cloud of data – PCA will draw a line through all the data that explains the most variation possible with that one line – or arrow if you want to visualize this.  That will be the first component.  PCA analysis will then go back to the cloud of data and draw a second line through the data that explains the next “most” variation – 2nd component, and it will continue to do this until there is no variation left.  If you have 10 variables, you will have 10 components to explain all the variation.  Each component explains a different amount of variation.  The first will explain the most, the second lesser, and so on.  PCA will provide you with eigenvalues which are translated to an amount of variation.

Now, each component will be made up of bits and pieces of the 10 variables – we will see these as the weightings within each Component or Eigenvector.  The most challenging part of PCA, will be the definition of the components.  It’s fine to say we have 2 components, but there’s more value in trying to define what the components represent.  For this, you will use the weightings within each eigenvector or component.

Now that we have a better feeling for what is happening when we run this analysis, I said earlier that our goal is to cut down from 10 variables to 2 or 3 components.  How do we do this?  During the analysis, we will see a SCREE PLOT – we want to use this as a guide to help determine how many components we will use.  Where the elbow appears in the scree plot, there is where we cut off the number of components to use.  Yes, a subjective decision.  We will also use the % variation explained by the components as a guide and aid to support our decision.

If we look at our current output – let’s first scroll down to the scree plot.  notice it shows the principal component number on the x-axis and the eigenvalue on the y-axis.  Also note that the elbow in the curve happens around the 3rd component.  If you look up at the table that shows you the proportion of the variation explained by the components we see that Component #1 explains 34%, Component #2 explains 26%, and Component #3 explains 9%.  Given that drastic drop between Component 2 and 3, I would select working with only the first 2 components.  Subjective decision!!!  But be able to back it up.  In this case, the first two explain a total of 60% of the variation seen across the 10 events, and as I look down to component #4 it also explains 9%.  So rather than trying to explain why I didn’t include component #4 and keeping component #3 when they are so  similar, I decide to cut it at 2 components.

The next challenge is trying to “define” the 2 components.  To attempt this, look at the table with the weightings for each component.

For component #1 we see a nice split of events with all the running events holding a -ve weighting and all other events a +ve weighting.  This component could be viewed as representing the running ability of the decathletes.

For component #2 the events with the lowest values are longjump, highjump, h100m – 3 jumping or height events.  It has been suggested that this component may be represented as the strength or endurance ability of the decathletes.  Very subjective again!!

This is the basis of PCA – however, in our output there is one particular output that many associate with PCA that does not accompany the default settings for PRINCOMP.  We want to see the component plots.  In SAS we need to specify these plots.

What I recommend, run the analysis as we have above to determine how many components you want to work with first.  Once you’ve decided then you will specify this in the following code to obtain the plots of interest.  In older versions of SAS you will need to turn the ODS graphics option on and off to take advantage of the advanced graphing abilities of SAS.  Please note that I have NOT tested this in SAS Studio!!!

ods graphics on;
Proc princomp data=olympic88 n=2 plots(ncomp=3)=all;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin r1500m;
RUn;
ods graphics off;

The above code will result in the following output.  One of the first things you will notice is that by adding the n=2 option in the Proc statement we are telling SAS to only calculate the first 2 components.  The plots(ncomp=3)=all produces all the plots following the Scree Plot.

You can use the Component Pattern Profile plot and the component plots to help you define what the components represent.

This is a fun and very straightforward example of how to use PCA with your data.

Name

 

ARCHIVE: W18 SAS Workshop: Getting Comfortable with your Data

Before we start any statistical analysis, we should really take a step back and get familiar and comfortable with our data.  “Playing” around with it to ensure that you know what’s in there.  This may sound funny, but getting comfortable with your data by running descriptive statistics really does two things:  One, you understand what’s been collected and how; and second, gives you the opportunity to review the data and find any errors in it.  Sometimes you may find an extra 1 added to the front of a number, or maybe a 6 instead of a 9, or any combinations of data entry errors.  By playing around with your data and getting comfortable with it before running your analysis, you may find some of these anomalies.

For this workshop, I will provide you with a starting SAS program, which you can download here.  You will be asked to type in the PROCs as we work through them, but if you would rather, you always have the option of copying them from this post and pasting them into your SAS editor or code window.  Please note, that there may be some nuances when you copy and paste.  Any ” will need to be changed in your SAS program!!!

My goals for this session are to review the following PROCedures:

  • Proc Contents
  • Proc Univariate
  • Proc Freq
  • Proc Means

PROC CONTENTS

PROC CONTENTS provides you with the backend information on your dataset.  One of the challenges in working with SAS, is that you do not have your dataset in front of you all the time.  You read it in and it gets sucked into what I call the “Blackbox of SAS”.  Sometimes we either what to see the data – to ensure it’s still there or simply to be comforted by the sight of it (we use PROC PRINT), or we want to see the contents of the dataset – so the formats of the variables and information about the dataset.

To do this we need to run a Proc CONTENTS on our file.  This is the equivalent of the Variable View in SPSS.

Proc contents data=woodchips;
Run;

What information were you able to see?  Information about the actual SAS datafile along with formatting information about the variables contained in the datafile.  View the output here as a PDF.

If you make changes to the variables along the way, or if you add labels, rerun the Proc CONTENTS to ensure the changes were applied.

PROC UNIVARIATE

Proc UNIVARIATE will be familiar to many of you as the PROC we use to see whether our data is normally distributed or not.  This is one use for this PROCedure, but it is also very handy to get a sense for your data.  It is one PROC that isn’t used to its full capability, in my opinion.

Let’s try running it as follows:

Proc univariate data=woodchips;
var weight;
Run;

Here is a link to the output saved as a PDF file.

As you review the output you can see the variety of descriptive statistics that this PROC provides you.  You should now have a very good feel for the data we are working with.

PROC FREQ

Proc FREQ is used to create frequencies and cross-tabulations.   In our dataset we only have one categorical variable, quality.  To create a frequency table use the following code:

Proc freq data=woodchips;
table quality;
Run;

Here is the link to the output saved as a PDF file.

Should you run a Proc FREQ on a variable such as weight?  Why or why not?

PROC MEANS

Proc MEANS is a fabulous and very versatile Proc to get a sense of your continuous variables, weight, in our example.  Let’s start with the overall mean by using this code:

Proc means data=woodchips;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

Note the default measures – N, Mean, StdDev, Min, Max

To add other descriptive measures, list them at the end of the Proc MEANS statement.  For example, we want the standard error and the Sum:

Proc means data=woodchips mean stderr sum;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

One last piece of code for Proc MEANS:  We want to see the means for each quality group.

Proc means data=woodchips;
class quality;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

For more ways to use Proc MEANS, visit the following blog entry on SASsyFridays: