SAS – Page 6 – Agricultural Statistics

SAS Syntax: Best Practices

There are so many different ways to write your SAS code or syntax. After working in SAS for a few decades now I would like to offer a series of Best Practices when it comes to writing your own syntax. Remember that how you write your syntax or code is up to you as long as it works – right 🙂

At the end of any DATA step or PROC – add a Run; statement. This will allow you to select pieces of your code to run separately from your whole program.
Always specify the name of the dataset you are using when running a PROC. This avoids any confusion in the results when you run parts of your program. SAS will run any analysis on the last dataset SAS used or created – which may not be the one you want it to use
When using PROC GLIMMIX, take advantage of the * or the commenting ability of SAS. Include all of your models in the PROC, but add a * in front of all the model statements you are not using at the moment. For example:
- Proc glimmix data=flowers plots=studentpanel;
  class block trmt;
  model count_flower = trmt;
  * model prop_flower = trmt;
  random block;
  Run;
Try to add more comments in your program. This way you will remember what you did in a couple of months.

If I think of any more items, I will come back and add them at a later date. If you have any suggestions, please drop me a line and I’ll add them in.

Name

SAS Workshop: GLIMMIX and non-Gaussian Distributions

One of the biggest advantages of using PROC GLIMMIX, is that the data coming into the analysis no longer needs to be normally distributed. It can have a number of distributions and SAS can handle it. Our job now is to be able to recognize when a normal distribution is NOT appropriate and which distribution is an appropriate starting place. Non-Gaussian distributions are what these are referred to. Remember Gaussian is the same as calling it Normal.

Where do we start? Think about your data – what is it?

A percentage?
A count?
A score?

How do we know that our data is not from a normal distribution?

Always check your residuals!
Remember the assumptions of your analyses?
- Normally distributed residuals is one of them!

Let’s work with the following example. We have another RCBD trial with 4 blocks and 4 treatments randomly assigned to each block. There were 2 outcome measures taken: proportion of the plot that flowered, and the number of plants in each plot at the end of the trial.

Please copy and paste the attached code to create the SAS dataset on your computer.

We will work through the output and how/when you need to add the DIST= option to your MODEL statement. We will also talk about the LINK= function and what it does.

Name

SAS Workshop: ANOVAs

ANOVAs or Analysis of Variance is one of the “classic” or standard statistical analyses that you will complete at one time or another during your research career. The statistical methodology behind the ANOVAs has changed a great deal over the past 40-50 years and SAS has done its part by creating new PROCs to match the statistical advances.

This part of the SAS Workshop will start by reviewing the SAS PROCedures that were created and used over the years. This will give you a better sense as to why we are using the newest PROCedure GLIMMIX and will hopefully provide you insight as to why some researchers are still using the older PROCedures GLM or MIXED.

History of ANOVA analyses in SAS

1966 – SAS is released with Proc ANOVA, which is to be used with:

balanced data ONLY!
FIXED effects ONLY!
NOTE from SAS Online Docs: “Caution:If you use PROC ANOVA for analysis of unbalanced data,you must assume responsibility for the validity of the results.

1976 – SAS released Proc GLM

balanced (Type I SS) and unbalanced (Type III SS)
RANDOM statement introduced – provides EMS (expected mean squares equations, but you need to do the calculations!)

1992 – Proc MIXED

RANDOM statement incorporated
REPEATED statement introduced
“Normally distributed” data ONLY
linear effects

1992 – Proc GENMOD

Non-normal data
Fixed effects ONLY

xxxx? – Proc NLMIXED

normal, binomial, Poisson distributions
nonlinear effects

2005 – Proc GLIMMIX

Proc MIXED
Proc NLMIXED
Non-normal data

Randomized Complete Block Design (RCBD)

We will start by analyzing the data collected from a small RCBD trial. There were 4 blocks, where 6 treatments were randomly assigned to each. To run these analyses, please copy and paste the following code into your SAS program. There may be edits that you will need to make when you paste into your program.

Data rcbd;
input block trmt Nitrogen;
datalines;
1 1 34.98
1 2 40.89
1 3 42.07
1 4 37.18
1 5 37.99
1 6 34.89
2 1 41.22
2 2 46.69
2 3 49.42
2 4 45.85
2 5 41.99
2 6 50.15
3 1 36.94
3 2 46.65
3 3 52.68
3 4 40.23
3 5 37.61
3 6 44.57
4 1 39.97
4 2 41.9
4 3 42.91
4 4 39.2
4 5 40.45
4 6 43.29
;
Run;

Run this data and use a Proc Print to ensure that the data was read correctly.

Once we have the data in our SAS program, let’s start with PROC GLM:

/* Proc GLM Statements */
Proc glm data=rcbd;
class block trmt;
model Nitrogen = block trmt;
random block;
title “Proc GLM Results”;
Run;
Quit;

Here is a PDF copy of the output created by the above code. I will review the output and the code that is used to generate it during the workshop.

Let’s move onto PROC MIXED:

/* Proc MIXED Statements with an LSMEANS for treatment differences */
Proc mixed data=rcbd;
class block trmt;
model Nitrogen = trmt;
random block;
title “Proc MIXED Results”;
Run;

Here is a PDF copy of the output created by the above code. I will review the output and the code that is used to generate it during the workshop.

Now let’s do the same analysis for a third time using PROC GLIMMIX. The code is:

/* Proc GLIMMIX Statements with an LSMEANS for treatment differences */
Proc glimmix data=rcbd;
class block trmt;
model Nitrogen = trmt;
random block;
title “Proc GLIMMIX Results”;
Run;

Here is a PDF copy of the output created by the above code. I will review the output and the code that is used to generate it during the workshop.

So… if you have used PROC MIXED in the past, moving to GLIMMIX is easy and highly recommended!

Name

SAS Workshop: Getting Comfortable with your data

PDF version of the workshop notes

Before we start any statistical analysis, we should really take a step back and get familiar and comfortable with our data. “Playing” around with it to ensure that you know what’s in there. This may sound funny, but getting comfortable with your data by running descriptive statistics really does two things: One, you understand what’s been collected and how; and second, gives you the opportunity to review the data and find any errors in it. Sometimes you may find an extra 1 added to the front of a number, or maybe a 6 instead of a 9, or any combinations of data entry errors. By playing around with your data and getting comfortable with it before running your analysis, you may find some of these anomalies.

For this workshop, I will provide you with a starting SAS program, which you can download here. You will be asked to type in the PROCs as we work through them, but if you would rather, you always have the option of copying them from this post and pasting them into your SAS editor or code window. Please note, that there may be some nuances when you copy and paste. Any ” will need to be changed in your SAS program!!!

My goals for this session are to review the following PROCedures:

Proc Contents
Proc Univariate
Proc Freq
Proc Means

PROC CONTENTS

PROC CONTENTS provides you with the backend information on your dataset. One of the challenges in working with SAS, is that you do not have your dataset in front of you all the time. You read it in and it gets sucked into what I call the “Blackbox of SAS”. Sometimes we either what to see the data – to ensure it’s still there or simply to be comforted by the sight of it (we use PROC PRINT), or we want to see the contents of the dataset – so the formats of the variables and information about the dataset.

To do this we need to run a Proc CONTENTS on our file. This is the equivalent of the Variable View in SPSS.

Proc contents data=woodchips;
Run;

What information were you able to see? Information about the actual SAS datafile along with formatting information about the variables contained in the datafile. View the output here as a PDF.

If you make changes to the variables along the way, or if you add labels, rerun the Proc CONTENTS to ensure the changes were applied.

PROC UNIVARIATE

Proc UNIVARIATE will be familiar to many of you as the PROC we use to see whether our data is normally distributed or not. This is one use for this PROCedure, but it is also very handy to get a sense for your data. It is one PROC that isn’t used to its full capability, in my opinion.

Let’s try running it as follows:

Proc univariate data=woodchips;
var weight;
Run;

Here is a link to the output saved as a PDF file.

As you review the output you can see the variety of descriptive statistics that this PROC provides you. You should now have a very good feel for the data we are working with.

PROC FREQ

Proc FREQ is used to create frequencies and cross-tabulations. In our dataset we only have one categorical variable, quality. To create a frequency table use the following code:

Proc freq data=woodchips;
table quality;
Run;

Here is the link to the output saved as a PDF file.

Should you run a Proc FREQ on a variable such as weight? Why or why not?

PROC MEANS

Proc MEANS is a fabulous and very versatile Proc to get a sense of your continuous variables, weight, in our example. Let’s start with the overall mean by using this code:

Proc means data=woodchips;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

Note the default measures – N, Mean, StdDev, Min, Max

To add other descriptive measures, list them at the end of the Proc MEANS statement. For example, we want the standard error and the Sum:

Proc means data=woodchips mean stderr sum;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

One last piece of code for Proc MEANS: We want to see the means for each quality group.

Proc means data=woodchips;
class quality;
var wood_weight;
Run;

Here is the link to the output saved as a PDF file.

For more ways to use Proc MEANS, visit the following blog entry on SASsyFridays:

SAS Workshop: Merging files and Creating new Variables

PDF version of the workshop notes

This workshop will walk through merging datasets and creating new variables in SAS. You can choose to do all these steps in Excel and many people do. But sometimes it can be easier to do it all in SAS. Use Excel to enter your data, but then let SAS do the data manipulations and analysis. One of SAS’ strengths is in data manipulation!

Let’s start by bringing the data into SAS. Here is a link to an Excel file called Wksp2_data. Download the file and save it on your computer. There are 4 worksheets in this file. I will leave it up to you as to how you bring the data into SAS, but you will need to bring all 4 sheets in as 4 different SAS datasets. Let’s call the January worksheet – January, the February one February, and so on.

End goal: 4 SAS datasets: January, February, March, April

Review your LOG window after you bring each one in to make sure you haven’t missed anything. Then run a Proc Print to ensure the data looks like you were expecting it – in other words, it should look like the Excel worksheets.

Best Practice Note: Add a comment line before each Proc to describe what you’re doing! This way you’ll remember when you go back to review your syntax.

Merging Datasets

When you think about merging datasets there are 2 ways that you would want to merge. Across or down.

Across or adding Variables

We have 3 months of data where the IDs are the same: January, February, and March. We want to add the weights taken from February and March and add them to the end of the January file. So we’re adding variables in this case.

To accomplish this we need to take 2 steps. The first is to sort each dataset to make sure they are in the same order.

Proc sort data=January;
by ID;
Run;

Proc sort data=February;
by ID;
Run;

Proc sort data=March;
by ID;
Run;

Step Two:

Create a new Dataset for the merged data. Remember that the Data statement saves the data using the name you give it – so let’s call it jan_mar – for January to March.

Data jan_mar;

We’re then going to tell SAS that we want to merge the 3 datasets and we want to merge them by ID.

Data jan_mar;
merge january february march;
by ID;
Run;

What does your LOG say? Is it right? What is the best way to make sure SAS has done what you wanted it to do?

What happened? How should we fix this?

We had 3 files that had the same variable names in each. So merging them, by adding variables didn’t really work, because we were not adding new variables, but we were replacing the contents of the variables month and weight. To fix this, we need to call the variables something different in each of the months. I will add the _jan, _feb, _mar, and _apr to the weight variable in dataset. I will also change the variable month month_1, month_2, etc…

Make these changes and rerun. Did it work this time?

You should now see where there are missing data too. Something that was not apparent the first time we ran this.

Adding Observations or Down

We have a 4th dataset that contains weight measurements taken in January and April, but you’ll notice that these belong to individuals who were not included in the first 3 data files of the trial. So we will need to add these to the bottom of the dataset currently called jan_mar.

We already have a file called April that contains the April data and now we have one called jan_mar that contains the merged data from January to March. Since these individuals were not included in the original data, we do not need to sort them, since ID plays no role here.

To add the data to our merged dataset we use the SET command rather than the MERGE command:

Data jan_apr;
set jan_mar april;
Run;

Proc print data=jan_apr;
Run;

Creating New Variables

We now have all the weight data for the individuals in our trial and would like to calculate the individual weight gains from January to March. As noted before we can calculate these in Excel, but let’s use SAS to do it, especially since we have a new merged dataset.

In order to calculate the weight gain, we will be touching the data, and whenever we touch the data, we need to work within a DATA step. So let’s create a new dataset and call it jan_apr_wtgain.

Since we will be using a dataset that is already available to us in SAS we use the SET command again to recall the jan_apr dataset.

Data jan_apr_wtgain;
set jan_apr;

Now we can create our new variable. Let’s call it wtgain and it will be the difference between the weight taken in January (weight_jan) and the weight taken in March (weight_mar).

Data jan_apr_wtgain;
set jan_apr;

wtgain = weight_mar – weight_jan;
Run;

Run a Proc Print to check your data. Did it work?

Recoding a variable

Sometimes we have a variable that we want to recode – so in our case we are going to create a new variable called wtclass that will take the weights measured in January and put them into 3 weight classes: 1 = 13-16; 2 = 17-20; 3 = 21-24

There are a number of different ways to accomplish this, I will post one here and depending on time in the workshop, I will show you others. I will also include the other options in the accompanying SAS syntax – to be posted after the workshop.

We are working with data again, so need to work within a DATA step. Let’s use the jan_apr_wtgain dataset.

Data jan_apr_wtgain;
set jan_apr;

wtgain = weight_mar – weight_jan;

if weight_jan < 17 then wtclass = 1;
if weight_jan ge 17 and weight_jan < 21 then wtclass = 2;
if weight_jan >20 then wtclass = 3;
Run;

Run a Proc Print and see what happened. Are we happy with these results?

What can SAS do?

There are so many more manipulations that SAS can do. These are just a couple of them and ones that may help you out as you start using SAS for your own research data. Check out others that have been discussed in the past on the SASsyFridays blog