ARCHIVE: W18 SPSS Workshop: Getting Comfortable with your Data

Before we start any statistical analysis, we should really take a step back and get familiar and comfortable with our data.  “Playing” around with it to ensure that you know what’s in there.  This may sound funny, but getting comfortable with your data by running descriptive statistics really does two things:  One, you understand what’s been collected and how; and second, gives you the opportunity to review the data and find any errors in it.  Sometimes you may find an extra 1 added to the front of a number, or maybe a 6 instead of a 9, or any combinations of data entry errors.  By playing around with your data and getting comfortable with it before running your analysis, you may find some of these anomalies.

For this workshop, we will use a fictitious dataset looking at 25 samples of woodchips, their weight and a quality score for the woodchips within each sample.  Please download the dataset here.  Once you have downloaded the Excel file, open it into SPSS.

My goals for this session are to review the use of the Descriptive Statistics in SPSS and some file information.

DATA FILE INFORMATION

When you receive a file from a colleague, labmate, website, or repository, it is often very handy to take at the Data File Information, to give you a sense as to what is contained in the file.  To accomplish this follow these steps:

  • File
    • Display Data File Information
    • Working File – which is the file that is currently open in SPSS

The data file information will now be available in the SPSS  Statistics Viewer.  Notice that the information is very similar to what we see in the Variable View, with the exception of the last 2 columns:  Print Format and Write Format.  These two columns show us the internal formatting of the variables.  Note that they are and should be the same for each variable.  The PRINT format is the format of the variable for output.  To change either FORMAT you will need to use the FORMATS command.  For more information on this please visit this page on the IBM Knowledge Center.

If there are any values set up in the dataset, the data file information will provide you with a small table with the values and their respective labels.  To test this out add the following labels to the Quality variable:

1 = Low Quality
2 = Regular Quality
3 = High Quality
4 = Exceptional Quality

Once you’ve added these to your dataset, save it on your computer, and try running the Data File Information again to see how the output changes.

Descriptive Statistics

Descriptive statistics are essentially that – they describe your data, or they summarize your data to give you a good, solid base understanding of what you have collected.  The type of descriptive statistics you will conduct will depend on the type of variable you have.  Remember the 3 types of variables that SPSS distinguishes between?

  • Scale – a continuous piece of information, also referred to as Interval or Ratio.  Examples: age, weight, height
  • Nominal – a categorical piece of data – there is NO relationship between the categories.  Examples:  religion, colour, gender
  • Ordinal – a categorical piece of data – this time there is a relationship or order to the categories.  Examples:  Year of study, age group, likert scales

Each of these data types will use a different type of descriptive statistic.  For instance, calculating the mean of colour makes no sense at all, but a frequency count of colour does work.

Frequency

To calculate the frequency of a categorical variable (nominal OR ordinal) in SPSS:

  • Analyze
  • Descriptive Statistics
  • Frequencies
    • Select the variables in question and drag to the right hand side
      • As an example, select Quality
    • Click OK to run

You should now have a frequency table of the variable, Quality

The lists the categories of the variable.  If you had not provided the value labels, you would see 1; 2; 3; 4 as the categories with no explanation as to what they represent.

The table lists Frequency – actual count of observation in each category; Percent – percent of observations as a total; Valid Percent – this will change if you have missing observations.  The Valid Percent is the percentage of observations that have values for Income Category; Cumulative Percent.

Mode

Mode is the value in the data that appears the most.  When you run the frequency you have a table that shows you the 5 levels of wood quality:

  • Low Quality = 5
  • Regular Quality = 6
  • High Quality = 8
  • Exceptional Quality = 6

By looking at these results I can see that High Quality appears to be the category that was selected the most.  But let’s get SPSS to do the hard work for us and confirm whether this is correct or not.

To obtain the MODE of a variable:

  • Analyze
  • Descriptive Statistics
  • Frequencies
    • Select the variables in question and drag to the right hand side
    • Click on the Statistics button on the right
      • Select Mode
      • Click Continue
      • Click OK

You should now see the Mode in the first table of the Frequency output.

Median

The median of a variable, is the middle value.  So if you have an even number of categories, there will be no median or middle value, but if you have an odd number you will see it.

To obtain the MEDIAN in SPSS, follow the same instructions as the MODE, but select the MEDIAN in the Statistics dialogue box.

Mean

The mean or average is calculated on a scale variable or continuous variable.  It just doesn’t make sense to calculate the mean of a categorical variable.

To obtain the MEAN in SPSS:

  • Analyze
  • Descriptive Statistics
  • Descriptives
    • Select the variable in question and drag to the right hand side
      • Click OK to run

You should now have a table with N, Minimum, Maximum, Mean, and Standard Deviation for the household income variable.  These are the default values you obtain when you run this analysis.  But, what happens if you want the Sum or the Standard Error of this variable?

  • Analyze
  • Descriptive Statistics
  • Descriptives
    • Select the variable in question and drag to the right hand side
    • Select the Options button – this will open another dialogue box that has a list of statistics to select from
      • Select Sum and S.E. mean (standard error of the mean)
    • Click Continue
    • Click OK to run

Your output table will now contain these added statistics.

Explore Function in SPSS

Sometimes you may want to determine what the mean household income by marital status or by another categorical variable.  Till now, we’ve been looking at the entire dataset.  There are a few ways to do this, but the most direct way is to use the Explore function in SPSS.

  • Analyze
  • Descriptive Statistics
  • Explore
    • In the Dependent List box, add the variables for which you would like to calculate the means
    • In the Factor List box, add the variable by which you would like to see the means for – for example: Quality
    • Click Ok to run.

You will now see a much larger table than we have seen to date.  SPSS provides you with a long list of descriptive statistics for wood chip weight by each quality category.

You will also see a Stem and Leaf plot along with a Boxplot to provide you with a sense of the distribution of the data.  More information to help you get a better feeling for the data that you are working with.

Summary

The common descriptive statistics that are used include: frequency, median, mode, mean, and measures of variation (standard deviation, standard error, etc..).  Each of these statistics should be run on the appropriate types of data – keep in mind, that a frequency on a variable such as age will give you a long table with meaningless information.

SPSS OUTPUT WINDOW

As we’ve been working along, you’ve already noticed that all the output or results can be found in a second window – referred to as the SPSS Statistics Viewer window.  If you want to save your work here, using the File -> Save or Save As option will save the entire output window as an .SPV file which is an SPSS format.  This means that if you want to re-open this file you must have SPSS installed on your computer.

If you only want to save a table or a chart, you have a couple of options:

  1. Export the parts you want to save as a Word, Excel, PDF, amongst a few more options.  To accomplish this, follow these steps:
    • select the tables, graphs that you want to export
    • File
      • Export… you should see a new dialogue box open.
        • At the top, ensure that you select “Selected”.  If you leave it as the default ALL, you will be exporting everything in the SPSS output window including the Notes for each analysis.
        • Select the Type of Document you wish to export to – PDF, Excel, etc…
        • Select the location and name for the file you will be exporting in the File Name box
        • Click OK to run
        • This will result in a new file in the location you set out – with the SPSS results you selected.
  2. Copy and Paste
    • This is probably the easiest way to save the tables or charts you want.  On a WINDOWS computer, simply select the table or chart, Copy (either by using the Menubar option or Ctrl-C), move to the document you want to paste the results into – Word, PPT, Excel, etc..  and Paste (either by using the Menubar option or Ctrl-V).
    • On a MAC, you will need to use the Menubar option and select Copy Special and check Image.  Move to the document you want the selected table or graph and Paste or Cmd-V.

Name

 

 

R: Data Wrangling

February 23, 2018 – session was taught by Andrew Frewin.  He took us through examples of different processes we may typically use to “wrangle” our research data.  Included in this post are two documents:

Please review the R Script file for comments on how each library and function was used in this session.

Next R-Users session will be on March 2.  We will be learning all about ggplot2

Name

Principal Component Analysis in SAS

Many statistical procedures test specific hypotheses.  Principal Component Analysis (PCA), Factor analysis, Cluster Analysis, are examples of analyses that explore the data rather than answer a specific hypothesis.  PCA examines common components among data by fitting a correlation pattern among the variables.  Often used to reduce data from several variables to 2-3 components.

Before running a PCA, one of the first things you will need to do is to determine whether there is any relationship among the variables you want to include in a PCA.  If the variables are not related then there’s no reason to run a PCA.  The data that we will be working with is a sample dataset that contains the 1988 Olympic decathlon results for 33 athletes.  The variables are as follows:

athlete:  sequential ID number
r100m:  time it took to run 100m
longjump:  distance attained in the Long Jump event
shotput:  distance reached with ShotPut
highjump:  height reached in the High Jump event
r400m: time it took to run 400m
h110m:  time it took to run 110m of hurdles
discus:  distance reached with Discus
polevlt:  height reached in the Pole Vault event
javelin:  distance reached with the Javelin
r1500m:  time it took to run 1500m

Let’s start with a PROC CORR to review the relationships among the variables.

Proc corr data=olympic88;
Run;

By reviewing the output found here,  we can see that there are a number of significant relationships suggesting that a PCA will be a valuable method of reducing our data from 10 variables to 2 or 3 components.

There are a few different PROCedures available in SAS to conduct a PCA.  My preferred PROC is PRINCOMP – short for principal components.  Let’s start with the basic syntax:

Proc princomp data=olympic88;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin                    r1500m;
Run;

The output starts with the same correlation matrix we created using the PROC CORR.  Although you’ll notice that there are no p-values available here.  Although we see the correlations we do not know whether they are significantly different from 0 or not.  We also have the Simple Statistics available – Means and Standard Deviations.

Our next table is a table of the eigenvalues for each component.  So let’s step back and talk about what this analysis is really doing for us.

Imagine my cloud of data – throw all the data in the air – there is variation due to the different events and there is variation within each event attributed to the performance of the different athletes.  Our goal with PCA is to reduce our data from the 10 events down to 2 or 3 components that represent these 10 variables (events).  Back to my cloud of data – PCA will draw a line through all the data that explains the most variation possible with that one line – or arrow if you want to visualize this.  That will be the first component.  PCA analysis will then go back to the cloud of data and draw a second line through the data that explains the next “most” variation – 2nd component, and it will continue to do this until there is no variation left.  If you have 10 variables, you will have 10 components to explain all the variation.  Each component explains a different amount of variation.  The first will explain the most, the second lesser, and so on.  PCA will provide you with eigenvalues which are translated to an amount of variation.

Now, each component will be made up of bits and pieces of the 10 variables – we will see these as the weightings within each Component or Eigenvector.  The most challenging part of PCA, will be the definition of the components.  It’s fine to say we have 2 components, but there’s more value in trying to define what the components represent.  For this, you will use the weightings within each eigenvector or component.

Now that we have a better feeling for what is happening when we run this analysis, I said earlier that our goal is to cut down from 10 variables to 2 or 3 components.  How do we do this?  During the analysis, we will see a SCREE PLOT – we want to use this as a guide to help determine how many components we will use.  Where the elbow appears in the scree plot, there is where we cut off the number of components to use.  Yes, a subjective decision.  We will also use the % variation explained by the components as a guide and aid to support our decision.

If we look at our current output – let’s first scroll down to the scree plot.  notice it shows the principal component number on the x-axis and the eigenvalue on the y-axis.  Also note that the elbow in the curve happens around the 3rd component.  If you look up at the table that shows you the proportion of the variation explained by the components we see that Component #1 explains 34%, Component #2 explains 26%, and Component #3 explains 9%.  Given that drastic drop between Component 2 and 3, I would select working with only the first 2 components.  Subjective decision!!!  But be able to back it up.  In this case, the first two explain a total of 60% of the variation seen across the 10 events, and as I look down to component #4 it also explains 9%.  So rather than trying to explain why I didn’t include component #4 and keeping component #3 when they are so  similar, I decide to cut it at 2 components.

The next challenge is trying to “define” the 2 components.  To attempt this, look at the table with the weightings for each component.

For component #1 we see a nice split of events with all the running events holding a -ve weighting and all other events a +ve weighting.  This component could be viewed as representing the running ability of the decathletes.

For component #2 the events with the lowest values are longjump, highjump, h100m – 3 jumping or height events.  It has been suggested that this component may be represented as the strength or endurance ability of the decathletes.  Very subjective again!!

This is the basis of PCA – however, in our output there is one particular output that many associate with PCA that does not accompany the default settings for PRINCOMP.  We want to see the component plots.  In SAS we need to specify these plots.

What I recommend, run the analysis as we have above to determine how many components you want to work with first.  Once you’ve decided then you will specify this in the following code to obtain the plots of interest.  In older versions of SAS you will need to turn the ODS graphics option on and off to take advantage of the advanced graphing abilities of SAS.  Please note that I have NOT tested this in SAS Studio!!!

ods graphics on;
Proc princomp data=olympic88 n=2 plots(ncomp=3)=all;
var r100m longjump shotput highjump r400m h110m discus polevlt javelin r1500m;
RUn;
ods graphics off;

The above code will result in the following output.  One of the first things you will notice is that by adding the n=2 option in the Proc statement we are telling SAS to only calculate the first 2 components.  The plots(ncomp=3)=all produces all the plots following the Scree Plot.

You can use the Component Pattern Profile plot and the component plots to help you define what the components represent.

This is a fun and very straightforward example of how to use PCA with your data.

Name