Please visit the SASsyFridays blog for this session.

Please visit the SASsyFridays blog for this session.

PDF copy of the complete SAS syntax used in this workshop
SAS creates graphs – no WAY!!! Yes WAY!! And it does a wonderful job and allows you to customize so many different aspects of a graph. However, like many things in SAS, there is a bit of a learning curve associated with this part of the SAS program AND.. turns out SAS/GRAPH is not available with the University Edition of SAS. You can still create graphs in the University Edition but just not the entire array of them.
To view the full array of graphes that are available in SAS/GRAPH, please visit the SAS Graphics Gallery.
For the purposes of this workshop, I will discuss ODS graphics and we will create a histogram that can be run in both SAS Studio and PC SAS. Time permitting I will showcase a website on the SAS support site to demonstrate the capabilities of SAS/GRAPH – not available on University Edition, but available to those of us running PC SAS or SAS on a server.
ODS you may recall from previous chats is the acronym for SAS’ Output Delivery System. It is our gateway to saving our outputs in a variety of formats, PDF, RTF, Excel, etc… ODS is also the engine behind part of SAS graphics. With many PROCedures, by turning on ODS graphics you can obtain a number of plots and graphics specific to that PROCedure. These are all available in University Edition – as long as you have access to the PROC.
To turn on the ODS graphics, simply type
ods graphics on;
It may be on by default, but we can ensure that it’s on by running the one line of code. To turn it off at the end of a specific PROCedure, type:
ods graphics off;
PROCedures that support ODS graphics and available in University Edition SAS Studio:
Let’s try one example with and without ODS graphics to see what we get. You can download a PDF copy of the SAS syntax here or copy the following syntax:
/* Working with a dataset in the SAS Help which contains
blood pressure measurements for males and females.
Let’s read it and save it locally on our own systems */
Data heart;
set sashelp.heart;
Run;
/* Run a Proc CONTENTS to get a sense of what information
can be found in this dataset */
Proc contents data=heart;
Run;
/* Since ODS graphics may be on by default
Let’s turn it off to see what the Proc TTEST
gives us without graphics */
ods graphics off;
/* Let’s run a TTest to see whether there are differences
between males and females for the diastolic measure of BP */
Proc ttest data=heart;
class sex;
var diastolic;
Run;
/* Now let’s turn on the ODS graphics */
ods graphics on;
/* Rerun the Ttest procedure – making no changes
to the code */
Proc ttest data=heart;
class sex;
var diastolic;
Run;
Each PROCedure listed in the table above will produce different plots related to the analysis at hand. For more information on the graphs produced by the PROC, please refer to the PROC documentation. The link in the table above will take you to the ODS graphics page within the PROC.
We will continue to work with the Heart dataset in the SAS Help directory. Now we are looking to create a HISTOGRAM from scratch, rather than using the ODS graphics option. The graph we are looking to create will contain a histogram for diastolic and systolic superimposed for only the Males in this dataset.
We have a large dataset with 5209 observations. For this exercise we would like to create a subset of this dataset that only contains the males. There are a number of ways to do this. I will demonstrate 2 different ways.
Data male_data;
set heart;
if sex = “Male”;
Run;
Creating a new dataset called male_data, reading the dataset heart – which we created earlier, the IF statement is saying only keep the observations where the variable called sex has a value of Male.
When you run this piece of code our new dataset, male_data, now contains only 2336 observations. If you need to see whether this was successful, you can run a Proc print – but restrict the number of observations to see by adding an (obs=xx) at the end of the Proc print statement:
Proc print data=male_data (obs=20);
Run;
OR run a Proc Freq on sex to see whether you have any females in the dataset:
Proc freq data=male_data;
tables sex;
Run;
If you’ve ever programmed in SQL, you’ll know the merits and advantages of using the SQL language. To use SQL you should have a great command on your data structure. In our case, we have a dataset called heart, we want to create a new dataset called male_data, with all observations that have the value of “Male” in the variable sex. In SAS, we have a PROCedure called SQL that allows you to use SQL coding. Here is the complete code, copy and run it, and we’ll work through each line of the code below.
Proc sql;
create table male_data as
select * from heart
where sex = “Male”;
Quit;
Run;
Notice that the 3 lines after the Proc statement is one line – in other words there is only one ; at the end of the 3 lines. Yes, I could have easily have kept all three lines as one line of code, but sometimes it is easier to see it in separate lines to see what’s happening.
create table male_data as – creating a new table or dataset in SAS and we’re calling it male_data – this does the same as the Data male_data; in our previous subsetting example.
select * from heart – as you read through the dataset, select all the observations in the dataset heart – this would be similar to our set heart; in the previous example.
where sex = “Male”; – only keep those observations that have a value of Male in the variable called sex. Similar to our if sex = “Male”; in the previous example.
Run the code and double-check again by either running the Proc print code or the Proc freq code to ensure that our male_data dataset only contains males.
Both methods provide you with the same resulting dataset. Which way you select, is really up to you. Reviewing a blog post from 2010 written by SAS, they have a great analogy that I will link to and repost here.
Using the Data Step is like going grocery shopping and going directly to the aisles where the items you need are located. You know where you need to go.
Using Proc SQL, is like going grocery shopping, but this time you give your list to an employee and you have no control as to how they are acquiring the items on your list. Your grocery list will be completed, but you don’t know how it was completed.
Very interesting analogy! The Data Step is procedural whereas SQL is not. Having said that, you may be asking, why would anyone use Proc SQL? For many people it is comfort! Many SAS programmers learned SQL first and will continue to use it, than move to the Data Step. Very much like myself, I learned to code in SAS and have a hard time moving to either Enterprise Guide or SAS Studio.
Either way works – pick the one you prefer!
We will be working with our male_data dataset we just created. To create the histogram we will be using the PROC SGPLOT. Here is the complete coding – let’s copy this into our SAS editor, run it, and discuss the coding line by line below.
Proc sgplot data=male_data;
histogram diastolic / transparency= 0.7 binwidth=10;
histogram systolic / transparency= 0.5 binwidth=10;
yaxis grid;
xaxis display=(nolabel);
Run;
histogram – statements are telling SAS the type of graph we are looking for in the output window. With a Proc SGPLOT you can create histograms, scatter plots, horizontal bars, vertical bars, and time series graph.
In our example we are creating 2 histograms – one for the diastolic measure and a second for the systolic measure. In both instances we are adding 2 options – one for the transparency of the bars and the second is the width of the BINs. After we run the graph for the first time, go back and change the BINWIDTH to see how the graph changes.
yaxis – is adding the y-axis gridlines – a label for the y-axis will be presented by default.
xaxis – since the label of the xaxis will be presented by default, with our code we are asking that it not be displayed.
Using the same male dataset – let’s create a scatter plot with the 2 blood pressure measures and ask SAS to draw a 95% prediction ellipse. Copy and run the following SAS syntax:
proc sgplot data=male_data;
scatter x=diastolic y=systolic;
ellipse x=diastolic y=systolic;
keylegend / location=inside position=bottomright;
run;
Scatter with an x= and y= will create a scatter plot with, in our example, the diastolic measure along the x-axis and systolic along the y-axis.
Ellipse will draw a 95% prediction ellipse around our data as specified by the x-axis and y-axis. You can also
keylegend – an option that places the legend inside the graph and on the bottom right side of the graph. Try changing the position to see what happens.
Creating graphs in SAS can be a fun challenge. If you are using University Edition SAS Studio, there is a limitation of what you can do, since the package SAS/GRAPH is not available to you. But the ODS graphics are available along with some of the more basic Graphing features. This post demonstrated the use of ODS graphics, and worked through 2 examples using Proc SGPLOT.

PDF Copy of the Correlation and regression notes
We’ve been learning about different aspects of R – how to bring data in, how to clean the data, and how to graph the data. Providing us with the basics of manipulating our data and visualizing it. Remember though that R is a system that is used for statistical computation as well as graphics. What makes is a robust system is that it includes a programming language, graphics, interfaces or connection opportunities with other languages, and debugging capabilities. Although we can use R for many different purposes we can also use it for our statistical needs.
To start the statistical side of things, today we will see how we can use R to create correlations and regressions with our data. We will use Fisher’s Iris dataset as the sample data to take us through correlations and regressions.
Download the R Script file used to work through this section of the R program.
Let’s work through the R script file in R Studio. I will add tricks and tips that I’ve learned in the sample at a later date. Please note that the R-script file contains comments throughout to guide you through it.
Enjoy!

This week we’ll take a quick tour of the classic T-tests, ANOVAs, and GLMs in SPSS. The dataset we will use will match up with the one that was used in the SAS workshop. Please download and open this Excel file into SPSS. This data is fictional and contains 4 variables:
This trial was designed as a Random Complete Block Design and should be analyzed as such. However, to showcase the t-test in SPSS we will take a step back and play with the data to start.
SPSS makes it easy for us to conduct t-tests on our data. If you go to file menu Analyze -> Compare Means you will see 3 different types of t-tests available to you. You should be comfortable with each one in order to be able to choose the correct one for your analysis.
One sample t-test: This test will compare the mean of your data to 1 value. Examples of this may be – you have collected %protein data on a number of different brands of adult dog food. The recommended %protein in adult dog feed is 25% and you want to check whether your samples are equal to the recommended amount. For this type of test you would use a One-sample t-test.
Independent Samples t-test: This test allows you to compare 2 means from 2 independent groups. Examples: Average age between males and females.
Paired-Samples t-test: This test allows you to compare 2 means taken on the same experimental units. Examples: Average weight before a treatment and average weight after a treatment on the same 20 experimental units.
These 3 tests are used primarily when the outcome variable you are testing is continuous or scale. There are similar t-test equivalents for outcome variables that may not follow that normal distribution. These are:
With our sample data – we have a variable called Field. We want to see whether there are any differences between the 2 fields where the data was collected. Please note – that we would NOT do this for our trials, we are only doing this for the purposes of this workshop. Ideally we would have a separate dataset that would be more appropriate, but in the interest of efficiency, I have chosen to use the same dataset and create a fictitious variable for demonstration purposes only.!!!
To conduct the Independent Samples t-test:
In the output window you should now see 2 tables. The first one displays the mean, standard deviation, and standard error for the Nitrogen variable for each group – so each Field.
The second table provides the t-test results. Note that the first half of this table contains the Levene’s test for equality of variances. One of the assumptions of a t-test, is that the variation of your outcome variance in both groups is equal. Lucky for us, SPSS provides us with t-test results for the situation where we have equal variances and when we do not have equal variances. In our case, the Levene’s test tells us that we have equal variation between our groups – p= 0.840 which means we accept our Null hypothesis that the variation between our groups is equal. We also see that there are indeed differences in Nitrogen between our 2 fields = < 0.0001 – note that the output says p= 0.000 because it only shows the first 3 digits. P is NEVER = 0!!
Now let’s assume that we conducted a Completely Randomized Design (CRD) where we randomly selected our experimental units and placed 4 onto each of the 6 treatments. If this was our experimental design then we would conduct a One-way ANOVA. There are 2 ways to do this in SPSS. Here is the first method:
Your output window should provide with the matching ANOVA table. In our example here, the Between Groups is non-significant with a p-value=0.959. The table shows us our Within Group SS, df, and MS.
The second method:
Your output window will now provide you with 2 tables. The first is a Between-Subjects Factor table – showing you where your observations are in relation to the fixed effect of treatment. In our example we can confirm that we have 4 observations (experimental units) on each of the 6 treatments. This is a great way of checking that SPSS has read your data correctly.
Your second table is the ANOVA table – labelled Tests of Between-Subjects Effects. Notice that the 1st ANOVA table you saw above matches this one, but it provides more information. The Intercept – which is the overall mean. The same conclusions are drawn from this table than the One-way ANOVA table. I would recommend that you perform any ANOVAs using this method.
So we know that our data was collected by implementing an RCBD, and we have a variable called Block in our dataset that is a RANDOM effect. How do we implement this aspect in SPSS?
The proper statistical model is:

To do this in SPSS:
In our output window you should now see 3 tables. The first one – Between Subjects Factor, lists the Treatments and the Blocks. Note that you have 6 observations in each block.
The second table presents our Tests of Between-Subjects Effects or our ANOVA table. Notice that each factor in our model lists the Hypothesis and the Error. This is because of our model. In our model – the error term has been corrected for the 2 effects in our model. Note that the p-value for Treatment is different from our fixed effects model p=0.787 – the model now incorporates our random Block factor – so it has adjusted or accounted for the variation due to Block before looking at the Treatment differences.
In our example, our treatments were not significant, therefore the means among our 4 treatments did not differ – no need to run any PostHoc or means comparisons tests. However, you should know how to run these in case you research data shows otherwise. To conduct PostHoc tests, we will do these on our Treatments for demonstration purposes, select the following :
Analyze
You will now have 2 additional tables in your output. The first one shows you each pairwise combination of treatments along with a difference, a standard error for the difference, a p-value, and 95% confidence limits for the difference. The bottom table summarizes this table.
NOTE: if you only have 2 levels in your treatment or fixed effect factor, SPSS will NOT run the PostHoc tests. It’s telling you that if the ANOVA says they’re different – then it doesn’t have to run the extra test because you already know the answer.
This workshop reviewed the use of t-tests, one-way ANOVA, and a GLM in SPSS. As an FYI, there is a lot of talk about GLIMMIX in the SAS side of the house and SPSS can do similar analyses – I will propose a workshop in the upcoming Summer session that will showcase GLMMs in SPSS.
Remember your research question when conducting any analysis and match the analysis to your research question – always!!

The goal of this workshop and blog post is to review 3 different multivariate analyses. We will use one common dataset to showcase the different purposes of the analyses and to showcase the different PROCedures available in SAS to conduct each analysis.
The dataset we will be using is the Fisher Iris dataset (1936), originally collected by Dr. E. Anderson and used to derive discriminant analysis by Dr. Ronald Fisher. The dataset contains measures of petal length, petal width, sepal length, and sepal width on 50 plants of 3 varieties of Iris’. The dataset is available within the SAS Help. To access this dataset you will need to use the dataset name: sashelp.iris.
When we think of statistics, most of us tend to think of our traditional hypothesis-driven analysis. So, the ANOVAs, regressions, means comparisons, and the list goes on. These are types of Explanatory Analyses. There is another world of statistics, referred by some as Exploratory Analyses – those analyses that are not driven by a hypothesis. Exploratory analyses are used more for describing relationships among variables or measures that were taken during a trial or in a dataset. Principal Component Analysis or PCA, and Cluster Analysis are two examples of exploratory analyses, whereas discriminant analyses falls into the explanatory analysis bucket.
Please review the PCA blog post for more details regarding this analysis. This post will not provide the same level of detail but will form the basis of using the same dataset across three different analyses.
The roots of the PCA were found in 1901 and was developed by Karl Pearson. Its primary role is top reduce the number of variables used to explain a dataset. Factor analysis (FA) is a related process and has the same goal. Many people confuse these two and tend to use the terms factor analysis and PCA interchangeably, when the two analyses are similar but not interchangeable. I’ve listed a few of the primary differences between PCA and Factor analysis:
/* For this workshop we will use the IRIS dataset */
/* Fisher’s dataset can be found in the SASHELP */
/* Library. The datsaet name is sashelp.iris */
/* SASHELP is the permanent SAS directory */
Proc print data=sashelp.iris;
Run;
/* Let’s get a sense of relationships that may exist */
/* in the dataset. We will use PROC SGPLOT to visualize */
Proc sgplot data=sashelp.iris;
scatter x=SepalLength y=PetalLength; * / datalabel=species;
Run;
/* We will run the PROC PRINCOMP as we did in the */
/* previous workshop. Options we are using include */
/* plots=all to show all plots available in the PROC */
/* n = 3 – we will start without this option and */
/* and then add it back to see only the 3 components */
ods graphics on;
Proc princomp data=sashelp.Iris standard plots=all n=3;
var SepalLength PetalLength SepalWidth PetalWidth;
Run;
ods graphics off;
To view the output. The output explanations will be the same as the explanations reviewed in the last post – only a different dataset.
Cluster analysis is a multivariate analysis that does not have ah hypothesis. We are interested in seeing whether there are any natural clusters or groups of the data. Clusters can be based on either the variables or measures collected in the dataset, OR they can be based on the observations within the dataset – variables or observations.
Clustering techniques will use two processes: distances and linkages. Being familiar with these terms may help you to select the most appropriate clustering technique for your data.
Distance: quantitative index defining the similarities of the clusters remaining in the analysis at each step.
Linkage: two clusters that have the smallest distance between them as determined by particular distance measures are then linked together to form a new cluster.
Standardizing the variables to be used in a cluster analysis is essential. Clustering techniques use some measure of “distance”, ensuring that all the variables are on the same level, will ensure a better clustering.
There are 2 broad types of clustering techniques used in Cluster Analysis:
In SAS there are 2 PROCedures that are commonly used for Cluster Analysis:
PROC Cluster and PROC Fastclus:
Directly from the SAS Online documentation:
“The CLUSTER procedure is not practical for very large data sets because, with most methods, the CPU time is roughly proportional to the square or cube of the number of observations. The FASTCLUS procedure requires time proportional to the number of observations and can therefore be used with much larger data sets than PROC CLUSTER. If you want to hierarchically cluster a very large data set, you can use PROC FASTCLUS for a preliminary cluster analysis to produce a large number of clusters and then use PROC CLUSTER to hierarchically cluster the preliminary clusters.”
When you create clusters – in any package – it is handy to calculate the means of the clusters and to run them through a Frequency analysis – essentially we want to be able to review some descriptive statistics on our new groups or clusters. PROC Fastclus saves all of this information for us be default or as part of the PROC coding, whereas with PROC Cluster you need to add in a few extra steps. Part of the coding that is commonly used includes a SAS Macro with PROC Cluster to run these descriptive statistics on our output clusters.
However, be assured that the output is the same whether you use PROC Fastclus or PROC Cluster with the macro. For simplicity, we will only use the PROC Fastclus syntax for our example.
/* Cluster Analysis */
/* Creating 2 clusters, saving the results in a new dataset called CLUS */
/* Try a Proc PRINT to see what is found in the new dataset CLUS */
Proc fastclus data=sashelp.iris maxc=2 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;
/* Using the resulting dataset to get a feel for who landed in what cluster */
Proc freq data=clus;
tables cluster*species;
Run;
/* Creating 3 clusters, saving the results in a new dataset called CLUS */
Proc fastclus data=sashelp.iris maxc=3 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;
/* Using the resulting dataset to get a feel for who landed in what cluster */
Proc freq data=clus;
tables cluster*Species;
Run;
/* To obtain a graphical presntation of the clusters we need to run the */
/* Proc CANDISC to get the information needed for the graphical output */
Proc candisc anova out=can;
class cluster;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 ‘Canonical Discriminant Analysis of Iris Clusters’;
Run;
Proc sgplot data=Can;
scatter y=Can2 x=Can1 /group=Cluster ;
title2 ‘Plot of Canonical Variables Identified by Cluster’;
Run;
To view the resulting output.
Extra piece of SAS code. If you need to standardize your variables before putting them into a Cluster analysis here is a sample piece of code that you can use:
/* If you need to standardize your variables – this is how you would do it */
Proc standard data=sashelp.iris out=iris mean=0 std=1;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;
/* Run a Proc PRINT to see what happened to your data and what changes happened */
Proc print data=iris;
Run;
/* Run a Proc MEANS to check whether the standardization worked or not */
Proc means data=iris;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;
As noted earlier, this analysis is not an exploratory analysis but an explanatory analysis. In fact it is very similar to an Multivariate ANOVA or MANOVA. It does however, have 2 distinct but compatible purposes:
So what does discriminant function do? Essentially it creates a weighted linear combination of the variables used in the analysis which is then used to differentiate or group observations into groups. Logistic regression comes to mind when you define discriminant analysis, however with logistic regression, the predictors can be quantitative or categorical and the fitted curve is sigmoidal in shape. Discriminant analysis can only use quantitative variables and all the assumptions of a general linear model must be met. So yes that means residual analysis – normality, homogeneity of variances, ….
One of the biggest challenges with discriminant analysis is sample size! The smallest group in your dataset MUST exceed the number of predictor variables by a “lot”. Papers have suggested at least 5 X or at least 10 X.
So, in the end discriminant analysis will essentially create a regression equation from your data that will “discriminate” observations into 2 groups – a variable in your dataset. Let’s look at the example to get a better feel for this.
/* Discriminant Analysis – Fisher’s Iris Data */
Proc discrim data=sashelp.iris anova manova listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
Run;
To view the resulting output.
A quick review of 3 different types of multivariate analyses using SAS and the same dataset. Each analysis has a different purpose. Please ensure that you use the most appropriate analysis for your research question!
