R vs SAS Series: Getting the data ready – ANOVA

Continuing on from our last blog post R vs SAS Series: Statistical Models Review – ANOVA, let’s take a look at how we need to get the data ready for our analysis.

Let’s review our statistical model.

Nitrateij = μ + trmti + eij


Nitrateij     = Stem nitrate amount of the jth observation in the ith trmt
μ                 = Overall mean or model intercept
trmti          = the effect of the ith treatment group
eij                  = random error or experimental error

This means that in order to run our analysis, we need to have stem nitrate measures and information about our treatments.  Specifically, we need to have in our dataset a column with the nitrate measures and a second column that tells us which treatment each nitrate measure was on.  You may also have a column that is an identifier – in this case Plot_ID which helps me to identify which plot the measurements were taken from.  A sample data table or Excel file may look like this:

Plot_ID Treatment Nitrate
101 1 34.98
102 2 40.89
103 3 42.07
124 6 43.29

Fixed vs Random Effects

Now we need to do a little bit of background work.  We’ve all heard of FIXED and RANDOM effects.  These should be driven by your statistical model!  In the example we are currently working with, we only have one effect:  Treatment.  Is it a FIXED or is it a RANDOM effect?

Let’s go back and look at some definitions and examples of these 2 terms.

Fixed Effects

Fixed effects are something you want to study – you set out the levels that you are interested in. You “fix” the levels. The results from your experiment can only talk about the levels you studied.

  • Example #1: I want to see whether 1st year students prefer Coke or Pepsi
  • Example #2: I want to see the effect of 3 levels of fertilizer on my crop

Random Effects

Random effects are factors in your design that may contribute variation in your outcome measure, but you are not interested in it. You only want to account for it, before looking at your treatment effects.

  • Example #1: I want to study the effect of fertilizer on my crop
  • Example #2: Block effect, Weather, etc…

Back to our example – what do you think our Treatment effect is?  If you said FIXED – you are correct!

Alrighty – so Treatment is a FIXED effect.  In our dataset, we entered the Treatment levels as 1, 2, 3, 4, 5, or 6 – in other words, we used numbers.  We could have used letters / alphanumeric / strings – doesn’t matter.  However, using numbers we need to let our programs know that these values are not numbers that we will calculate means or manipulate in any way.  They are to be used as a grouping or classification or as a factor variable.  Something that tells us and the program which treatment each of our nitrate values comes from.

In SAS – we can do this very simply by including the Treatment variable in a CLASS statement.  However, in R, we need to change the format of the variable to a factor.  TO do this we need to use the following R script:

Treatment <- as.factor(Treatment)

We’ll see how this fits in with our ANOVA coding in the next Blog post.  For know – remember:

  1. We need to determine which of our factors are FIXED or RANDOM
  2. In R, we need to change the format of our factors using the as.factor() function.

Quick Recap

Everything is based on that statistical model – please remember what it is for your trial

Factors in our model may be FIXED or RANDOM

In SAS we can tell the program which variables are factors by listing them in a CLASS statement.

In R, we need to use the as.factor() function to change the format of our factor variables to a factor

Coming up next in this mini series

  1. R vs. SAS Series: Conducting the ANOVA
  2. R vs. SAS Series: Reading the ANOVA outputs
  3. R vs. SAS Series: RCBD – ANOVA
  4. R vs. SAS Series: RCBD – Reading the ANOVA outputs


R vs SAS Series: Statistical Models Review – ANOVA

This will be the first in a mini series of blog posts discussing the differences between R and SAS.  

One of the challenges of supporting the 2 programs (R and SAS) is not the differences in the coding – I can talk about the differences in the coding being similar to different  spoken languages.  We may be fluent in English, but we may also be able to speak in French or another language.  This is no different than the coding for SAS and R.  I may be fluent in SAS, but now I can also speak or rather code in R.  Same concepts – just a slightly different language.  Many of us that speak more than one language can see the similarities between the languages.  This is also true with the coding language of SAS and R.  There are many similarities, you just need to learn the specific language nuances.

So, yes learning the language can be challenging, but I think I’m finding the differences in the outputs can be equally as challenging.  I remember the first time I saw the output for LMER and thought – where is my ANOVA table???  and my p-values?  Heck SAS gives it all to me – maybe I should stick to SAS.  But the industry and many of my students and researchers are moving to R – for many, many great reasons.  Ok – time to admit that I’m falling in love with R too (shhh…  don’t let SAS know 🙂  ).   So, I think it’s time to dig into the output differences.

Before we get too deep into our example – we need to take a step back and talk about “statistical models”.  Understanding why these are so important and why they are the key to our analyses, will help us better understand the differences we may see between the SAS and R outputs.

Statistical Model

If you’ve taken a workshop or a class with me – you know that I am a firm believer in experimental designs and statistical models.  Once you have a research question, you can design your experiment – with your experimental design, you know what your statistical model is – with your statistical model in hand, you know what data you will be collecting – with all this information, you know what statistical analyses you will be conducting, and how you will be presenting your results.  Phew!  and I haven’t even collected my data yet!  Just for giggles try it out with your current project and let me know how it works out.

The example dataset we will be using for this blog post and the following one is a dataset from the Kuehl textbook (example 8.1:  Design of Experiments:  Statistical Principles of Research Design and Analysis).

We will first be using the data as if it were collected from a CRD or a completely randomized design.  In other words we have 24 experimental units that were randomly assigned to 6 treatment groups.  With this design our statistical model is:

Nitrateij = μ + trmti + eij


Nitrateij = Stem nitrate amount of the jth observation in the ith trmt
μ                 = Overall mean or model intercept
trmti          = the effect of the ith treatment group
eij                  = random error or experimental error

Let’s just break this down a little bit more before we look at the data.

We have a number of observations where we’ve measured the stem nitrate content of wheat within the experimental unit – a plot in a field.  We have 24 of these plots and we randomly assigned the plots to receive one of 6 treatments.  In an ideal world, each of the 24 plots are identical – but we know that this just isn’t possible.  There may be differences between the plots due to their location in the field, maybe some receive more sun than others do, we know that the soil in a plot can vary a lot for many reasons.  In the end, we know that there are inherent differences between our plots – but we are confident that they aren’t “THAT” different and that we can safely assume they are similar enough to use in this experiment.  Now, let’s turn our attention to our treatments.  As a researcher, we will do our best to ensure that the treatments are applied to our experimental units as similarly as possible.  We know that it is almost impossible to ensure that the treatments are applied identically to all plots!  We do our best though!

Can you see where I’m going with this?

The goal of our experiment is to have experimental units that are as similar as possible and to apply the treatments as similarly as possible, so that when we see any differences at the end of our trial, we are confident that those differences are due to the treatments applied.   However, we know that this isn’t possible – that we have other sources of variation that may come into play – experimental units are not identical, applying the treatments was not perfect, etc…

When we conduct our analysis – note that we are doing an ANOVA – analysis of variance.  Yes – we are looking at the variation in the measures we took – stem nitrate content in this example – and we are analyzing it or better yet – we are partitioning or breaking apart the overall variation we see in our stem nitrate measures into its components.  This is why know the statistical model is so important!!

In our CRD – we are partitioning the variation of our nitrate measures into our treatments – nothing else!  However, we recognize that we cannot explain it all and that’s why we will always have that experimental error or random error.  This is the part of the variation in our measure that we just cannot explain with our data – that bit where our experimental units are not identical or that bit where we could not apply our treatments identically.  In other words – if you think of the nitrate measures that were collected from this study and visualize them as a cloud of data – the variation of the measures is the cloud.  The ANOVA will look at this cloud and determine if we can pull apart the different treatments – can we see a clumping of that cloud in one area that represents a treatment?  Or maybe all the treatments overlap and we cannot pull them apart of partition the variation of the different treatment levels.

Our statistical model allows us to look at our data in a couple of ways.  First it helps us identify the different sources of variation in our measures, and second, it also allows us to predict values.  Hmm..  what?  Remember that we always need to check the assumptions of our model – and the assumptions all deal with the model residuals.  How do we define residuals again?  Predicted – observed values.  Predicted values come from our statistical model.

Let’s take another  look at our statistical model.

Nitrateij = μ + trmti + eij

What it’s saying is:  our stem nitrate measure is made up of the overall mean + the effect of the treatment it was on + some random error.  Once I run my ANOVA, I should be able to tell you what the predicted value of an observation on any given treatment should be.  In other words, I could break up the measure I took for stem nitrate from my trial, and tell you what the overall mean for the trial was, and how much of the measure was attributed to the treatment it was on.  Cool eh??

I hope this all makes sense – as it is important to be comfortable with this as we move along to talking about how SAS and R differ.

Quick recap

Research question -> experimental design

Experimental design -> statistical model

Statistical model ->  ANOVA

ANOVA – partitioning of variation in our outcome measure – stem nitrate amount in this example.  Think of all your data as a cloud – the ANOVA will tell you whether it is able to break apart the cloud (variation) into the treatment groups.

ANOVA – also tells you how much each treatment contributes to the outcome measures.  Predicted values.

Coming up next in this mini series

  1. R vs. SAS Series: Getting the data ready – ANOVA
  2. R vs. SAS Series: Conducting the ANOVA
  3. R vs. SAS Series: Reading the ANOVA outputs
  4. R vs. SAS Series: RCBD – ANOVA
  5. R vs. SAS Series: RCBD – Reading the ANOVA outputs