Principal Component Analysis in SAS

Many statistical procedures test specific hypotheses.  Principal Component Analysis (PCA), Factor analysis, Cluster Analysis, are examples of analyses that explore the data rather than answer a specific hypothesis.  PCA examines common components among data by fitting a correlation pattern among the variables.  Often used to reduce data from several variables to 2-3 components.

Before running a PCA, one of the first things you will need to do is to determine whether there is any relationship among the variables you want to include in a PCA.  If the variables are not related then there’s no reason to run a PCA.  The data that we will be working with is a sample dataset that contains the 1988 Olympic decathlon results for 33 athletes.  The variables are as follows:

athlete:  sequential ID number
r100m:  time it took to run 100m
longjump:  distance attained in the Long Jump event
shotput:  distance reached with ShotPut
highjump:  height reached in the High Jump event
r400m: time it took to run 400m
h110m:  time it took to run 110m of hurdles
discus:  distance reached with Discus
polevlt:  height reached in the Pole Vault event
javelin:  distance reached with the Javelin
r1500m:  time it took to run 1500m

Let’s start with a PROC CORR to review the relationships among the variables.

Proc corr data=olympic88;
Run;

By reviewing the output found here,  we can see that there are a number of significant relationships suggesting that a PCA will be a valuable method of reducing our data from 10 variables to 2 or 3 components.

There are a few different PROCedures available in SAS to conduct a PCA.  My preferred PROC is PRINCOMP – short for principal components.  Let’s start with the basic syntax:

Proc princomp data=olympic88;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin                    r1500m;
Run;

The output starts with the same correlation matrix we created using the PROC CORR.  Although you’ll notice that there are no p-values available here.  Although we see the correlations we do not know whether they are significantly different from 0 or not.  We also have the Simple Statistics available – Means and Standard Deviations.

Our next table is a table of the eigenvalues for each component.  So let’s step back and talk about what this analysis is really doing for us.

Imagine my cloud of data – throw all the data in the air – there is variation due to the different events and there is variation within each event attributed to the performance of the different athletes.  Our goal with PCA is to reduce our data from the 10 events down to 2 or 3 components that represent these 10 variables (events).  Back to my cloud of data – PCA will draw a line through all the data that explains the most variation possible with that one line – or arrow if you want to visualize this.  That will be the first component.  PCA analysis will then go back to the cloud of data and draw a second line through the data that explains the next “most” variation – 2nd component, and it will continue to do this until there is no variation left.  If you have 10 variables, you will have 10 components to explain all the variation.  Each component explains a different amount of variation.  The first will explain the most, the second lesser, and so on.  PCA will provide you with eigenvalues which are translated to an amount of variation.

Now, each component will be made up of bits and pieces of the 10 variables – we will see these as the weightings within each Component or Eigenvector.  The most challenging part of PCA, will be the definition of the components.  It’s fine to say we have 2 components, but there’s more value in trying to define what the components represent.  For this, you will use the weightings within each eigenvector or component.

Now that we have a better feeling for what is happening when we run this analysis, I said earlier that our goal is to cut down from 10 variables to 2 or 3 components.  How do we do this?  During the analysis, we will see a SCREE PLOT – we want to use this as a guide to help determine how many components we will use.  Where the elbow appears in the scree plot, there is where we cut off the number of components to use.  Yes, a subjective decision.  We will also use the % variation explained by the components as a guide and aid to support our decision.

If we look at our current output – let’s first scroll down to the scree plot.  notice it shows the principal component number on the x-axis and the eigenvalue on the y-axis.  Also note that the elbow in the curve happens around the 3rd component.  If you look up at the table that shows you the proportion of the variation explained by the components we see that Component #1 explains 34%, Component #2 explains 26%, and Component #3 explains 9%.  Given that drastic drop between Component 2 and 3, I would select working with only the first 2 components.  Subjective decision!!!  But be able to back it up.  In this case, the first two explain a total of 60% of the variation seen across the 10 events, and as I look down to component #4 it also explains 9%.  So rather than trying to explain why I didn’t include component #4 and keeping component #3 when they are so  similar, I decide to cut it at 2 components.

The next challenge is trying to “define” the 2 components.  To attempt this, look at the table with the weightings for each component.

For component #1 we see a nice split of events with all the running events holding a -ve weighting and all other events a +ve weighting.  This component could be viewed as representing the running ability of the decathletes.

For component #2 the events with the lowest values are longjump, highjump, h100m – 3 jumping or height events.  It has been suggested that this component may be represented as the strength or endurance ability of the decathletes.  Very subjective again!!

This is the basis of PCA – however, in our output there is one particular output that many associate with PCA that does not accompany the default settings for PRINCOMP.  We want to see the component plots.  In SAS we need to specify these plots.

What I recommend, run the analysis as we have above to determine how many components you want to work with first.  Once you’ve decided then you will specify this in the following code to obtain the plots of interest.  In older versions of SAS you will need to turn the ODS graphics option on and off to take advantage of the advanced graphing abilities of SAS.  Please note that I have NOT tested this in SAS Studio!!!

ods graphics on;
Proc princomp data=olympic88 n=2 plots(ncomp=3)=all;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin r1500m;
RUn;
ods graphics off;

The above code will result in the following output.  One of the first things you will notice is that by adding the n=2 option in the Proc statement we are telling SAS to only calculate the first 2 components.  The plots(ncomp=3)=all produces all the plots following the Scree Plot.

You can use the Component Pattern Profile plot and the component plots to help you define what the components represent.

This is a fun and very straightforward example of how to use PCA with your data.

Name

 

ARCHIVE: W18 SAS Workshop: Getting Comfortable with your Data

Before we start any statistical analysis, we should really take a step back and get familiar and comfortable with our data.  “Playing” around with it to ensure that you know what’s in there.  This may sound funny, but getting comfortable with your data by running descriptive statistics really does two things:  One, you understand what’s been collected and how; and second, gives you the opportunity to review the data and find any errors in it.  Sometimes you may find an extra 1 added to the front of a number, or maybe a 6 instead of a 9, or any combinations of data entry errors.  By playing around with your data and getting comfortable with it before running your analysis, you may find some of these anomalies.

For this workshop, I will provide you with a starting SAS program, which you can download here.  You will be asked to type in the PROCs as we work through them, but if you would rather, you always have the option of copying them from this post and pasting them into your SAS editor or code window.  Please note, that there may be some nuances when you copy and paste.  Any ” will need to be changed in your SAS program!!!

My goals for this session are to review the following PROCedures:

  • Proc Contents
  • Proc Univariate
  • Proc Freq
  • Proc Means

PROC CONTENTS

PROC CONTENTS provides you with the backend information on your dataset.  One of the challenges in working with SAS, is that you do not have your dataset in front of you all the time.  You read it in and it gets sucked into what I call the “Blackbox of SAS”.  Sometimes we either what to see the data – to ensure it’s still there or simply to be comforted by the sight of it (we use PROC PRINT), or we want to see the contents of the dataset – so the formats of the variables and information about the dataset.

To do this we need to run a Proc CONTENTS on our file.  This is the equivalent of the Variable View in SPSS.

Proc contents data=woodchips;
Run;

What information were you able to see?  Information about the actual SAS datafile along with formatting information about the variables contained in the datafile.  View the output here as a PDF.

If you make changes to the variables along the way, or if you add labels, rerun the Proc CONTENTS to ensure the changes were applied.

PROC UNIVARIATE

Proc UNIVARIATE will be familiar to many of you as the PROC we use to see whether our data is normally distributed or not.  This is one use for this PROCedure, but it is also very handy to get a sense for your data.  It is one PROC that isn’t used to its full capability, in my opinion.

Let’s try running it as follows:

Proc univariate data=woodchips;
var weight;
Run;

Here is a link to the output saved as a PDF file.

As you review the output you can see the variety of descriptive statistics that this PROC provides you.  You should now have a very good feel for the data we are working with.

PROC FREQ

Proc FREQ is used to create frequencies and cross-tabulations.   In our dataset we only have one categorical variable, quality.  To create a frequency table use the following code:

Proc freq data=woodchips;
table quality;
Run;

Here is the link to the output saved as a PDF file.

Should you run a Proc FREQ on a variable such as weight?  Why or why not?

PROC MEANS

Proc MEANS is a fabulous and very versatile Proc to get a sense of your continuous variables, weight, in our example.  Let’s start with the overall mean by using this code:

Proc means data=woodchips;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

Note the default measures – N, Mean, StdDev, Min, Max

To add other descriptive measures, list them at the end of the Proc MEANS statement.  For example, we want the standard error and the Sum:

Proc means data=woodchips mean stderr sum;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

One last piece of code for Proc MEANS:  We want to see the means for each quality group.

Proc means data=woodchips;
class quality;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

For more ways to use Proc MEANS, visit the following blog entry on SASsyFridays:

ARCHIVE: W18 RDM Workshop: Review

This workshop is the fourth in a series of 4 offered in partnership with Carol Perry, Associate Librarian Research and Scholarship.  These workshops are hands-on and have exercises associated with each aspect being covered in the workshop.

This workshop reviews all the information we discussed in this workshop series and brings it all back to the Data Management Plan (DMP).  The powerpoint presentation is available here, please review for more information and contact either Carol Perry or Michelle Edwards for questions.

Crimes of Statistics: Is it RANDOM or is it FIXED?

A topic that comes up a lot these days during my consultation appointments.  Deciding whether our treatments are FIXED or RANDOM is easy, but when we combine experiments – something that is commonly done in Plant Agriculture field – are years, trials, environments FIXED or RANDOM?

I’d like to propose that we talk about this one question during our session this coming week.  I’m proposing that you read the following paper in preparation for our discussion.  Moore and Dixon do a great job at digging into this topic, but there is still room for a discussion, especially how it relates to your own trials.  See you on Wednesday, February 14, 2018.

Moore, K.J. and Dixon, P.M. (2015). Analysis of Combined Experiments Revisited.  Agronomy Journal 107(2): 763-771. doi:10.2134/agronj13.0485

Name