Non-Gaussian Distributions

One of the most common questions that is coming up today, when researchers are using Proc GLIMMIX is:  What distribution should I use?  So, let’s take a look at this question from a very practical application viewpoint.

First of all – what the heck is “Non-Gaussian”???

Gaussian is a fancy word for Normal.  So if I have normally distributed data or data that conforms to a Normal distribution, my data is Gaussian.  So you don’t  have to worry about the DIST= option in PROC GLIMMIX (now that is of course, assuming that your residuals are normal and all that other good stuff too!!!).

But, if you’ve collected data that may be categorical in nature, a proportion, or a time to event – then you’ve got non-normal data or non-gaussian data, and YES! you have to figure out what distribution is appropriate when running your model using PROC GLIMMIX.

Based on the paper by Walter W. Stroup (2015) Rethinking the Analysis of Non-Normal Data in Plant and Soil Science, (Agronomy Journal 107(2): 811-827), I’m recommending that you use examples as your starting place.  My goal here is to keep adding examples to this post to provide you with a guide as a starting point to help you determine the appropriate distribution.  Remember!  When you use the DIST= option in your SAS PROC GLIMMIX code, make sure that you are using the appropriate LINK= option as well.  These two go hand-in-hand.

Below are brief descriptions of the different distributions currently available through PROC GLIMMIX and examples listed below each.  As you work with GLIMMIX and the different distributions, please pass on your examples to me, so that I can add them below.  Also, note that there are relationships between and among many of the distributions, and that is why you will have people comparing distributions for the best fit.  Please use this as a guide to help you determine a starting point for your analysis.

Binomial Distribution

A binomial distribution is one that we are all familiar with, believe it or not!  Remember way back in introductory statistics class, the flip the coin exercise?  Yup – that’s a binomial distribution.  There are only 2 possibilities, and you are looking at the proportion of “individuals” that result in one category or another.  Remember p and q?  p = individuals in one category and q = 1-p or the proportion of individuals in the other category.

Does the data that you’ve collected fall into under this distribution?

Examples:

  • % of seeds that germinate – seeds germinate or not

 

Poisson or Negative Binomial Distribution

When we think about count data we tend to think of a Poisson distribution.  The Poisson distribution is often used where we count events that occur randomly over space or time.  There are usually no fixed number of trials or events – so we don’t know whether we will count an event happening or an occurrence 5 times or twice or even 30 times.

One key attribute of the Poisson distribution is that the variance is equal to the mean.  so, when we have a high mean value (µ = 100), we expect to see more variability in our sample counts – or conversely a low mean value (µ = 5) we expect to see little variability. There are times when the variance exceeds the mean value, this is referred to as Overdispersion – essentially – we are not able to do a great job with the Poisson distribution in describing the variation of our measure.

This situation is where the Negative Binomial distribution, an extension of the Poisson distribution comes in handy.  The Negative Binomial distribution accounts for the random variation around the mean.

If you are collecting count data for your project, you have 2 options:  Poisson distribution – if your mean = variance, and Negative Binomial – if your variance is greater than your mean.

Examples:

  • Weed count per plot

 

Exponential or Gamma Distribution

Exponential distribution is a model that predicts a measure over time, the basic one representing a constant over time.  Survival probabilities are a great example of exponential distributions.  The Exponential distribution is a special form of the Gamma distribution.  The Gamma distribution allows you to model for the shape of the curve or a measure of kurtosis (is there a peak in the curve and if so, how peaked or how flat), and it allows you to model for the scale of the curve – almost like the range of the observations.  Both of these distributions are characterized by Probability Density Functions.

Examples:

  • Time to flowering

 

Multinomial Distribution

Multinomial Distribution is the situation where you may have more than 2 possible outcomes.  A great example of this is a rating scale.  The data collected has an equal chance of falling into one of several groups.  If you think back to your types of data – this would be nominal or ordinal data.  You may also look at this data from a count perspective and there is a strong relationship between Poisson distribution and the Multinomial distribution – you need to think about what your measure is to determine the most appropriate starting distribution.

Examples:

  • Disease Rating Category

 

Beta Distribution

The Beta Distribution is another example of a Probability Density function, similar to the Exponential and Gamma distributions.  The Beta distributions are characterized by a measure of a proportion.

Examples:

  • Proportion of leaf area affected

 

 

Ridgetown Workshop – August 2, 2017

SAS program files:

Data for balanced RCBD example

Complete SAS program for balanced RCBD example

Data rcbd;
input block trmt Nitrogen;
datalines;
1 1 34.98
1 2 40.89
1 3 42.07
1 4 37.18
1 5 37.99
1 6 34.89
2 1 41.22
2 2 46.69
2 3 49.42
2 4 45.85
2 5 41.99
2 6 50.15
3 1 36.94
3 2 46.65
3 3 52.68
3 4 40.23
3 5 37.61
3 6 44.57
4 1 39.97
4 2 41.9
4 3 42.91
4 4 39.2
4 5 40.45
4 6 43.29
;
Run;

Data for unbalanced RCBD example

Complete SAS program for unbalanced RCBD example

Data rcbd_unb;
input block trmt Nitrogen;
datalines;
1 1 34.98
1 2 40.89
1 3 42.07
1 4 37.18
1 5 37.99
1 6 34.89
2 1 41.22
2 2 46.69
2 3 49.42
2 4 45.85
2 5 41.99
2 6 50.15
3 2 46.65
3 3 52.68
3 4 40.23
3 5 37.61
3 6 44.57
4 1 39.97
4 3 42.91
4 4 39.2
4 5 40.45
4 6 43.29
;
Run;

Data for Repeated Measures example

Complete SAS program for Repeated Measures example

Data repeated;
input ID Room trmt day wt;
datalines;
1 1 1 1 13
2 1 1 1 17
3 1 1 1 13
4 1 2 1 16
5 1 2 1 17
6 1 2 1 17
1 2 1 2 22
2 2 1 2 24
3 2 1 2 20
4 2 2 2 23
5 2 2 2 22
6 2 2 2 23
1 3 1 3 36
2 3 1 3 38
3 3 1 3 46
4 3 2 3 45
5 3 2 3 45
6 3 2 3 32
;
Run;

Data for Count example

Complete SAS program for Count data example

Data trial;
input trmt$ block count;
datalines;
A 1 69
A 2 56
A 3 20
A 4 63
B 1 69
B 2 72
B 3 74
B 4 82
C 1 87
C 2 72
C 3 80
C 4 95
D 1 78
D 2 72
D 3 50
D 4 94
;
Run;

S17 SAS Workshop: Proc GLM, Proc MIXED, Proc GLIMMIX – an overview – RCBD

Notes For the CRD and RBCD Workshop – PDF file

This workshop will look at a Randomized Complete Block Design (RCBD) in Proc GLM, Proc MIXED, and Proc GLIMMIX.  The goal is to review the coding similarities & differences, along with the differences & similarities in the respective outputs.

The SAS program can be found here – please note that it is a PDF file

Proc GLM Results

Proc MIXED Results

Proc GLIMMIX Results

 

S17 SAS Workshop: Proc GLM, Proc MIXED, Proc GLIMMIX – an overview – CRD

Notes For the CRD and RBCD Workshop – PDF file

The goals of this workshop are:

  • to compare Proc GLM, Proc MIXED, Proc GLIMMIX using a Completely Randomized Design (CRD) for the example by:
    • showing coding differences
    • showing output differences
  • to provide guidelines/explanations as to why and when you would use GLM, MIXED, and GLIMMIX

Proc GLIMMIX, appears to be the “new” kid on the block when it comes to analyzing our data.  But believe it or not, GLIMMIX has existed for many years, but never really caught on, until a few years ago.  Many of us now are relearning our traditional analyses methods in SAS and converting to GLIMMIX.

There will be several workshops that will concentrate on the use of Proc GLIMMIX.  The idea is that we will start with the straighforward experimental designs and increase the complexity to showcase the strengths of GLIMMIX and maybe convince you to make the switch to this more robust SAS procedure.  This workshop will use the basic Completely Randomized Design to primarily show coding and output differences among the 3 procedures.

Completely Randomized Design

Our fictitious design has 6 treatments (A, B, C, D, F, G) with 4 observations per treatment. Our Null Hypothesis states that all treatment means are equal, with our Alternate Hypothesis stating that at least 2 means are not equal.  We will have a model to reflect this design as:

Outcome variable(Weight) = overall mean + Treatment effect + residual error

To read in the data we will use a Data Step as follows:

/***************************************************************************/
/* Reading data gathered from a CRD conducted across 6 treatments */
/* Variables are 6 treatments and weight collected in hypothetical units */
/* This is a dummy dataset created for the purposes of a demo and workshop */
/* Created by A.M.Edwards May 23, 2017 */
/***************************************************************************/

Data crd;
input ID trmt$ weight;
datalines;
1 A 41
2 A 24
3 A 33
4 A 38
5 B 24
6 B 21
7 B 16
8 B 43
9 C 46
10 C 33
11 C 14
12 C 19
13 D 32
14 D 38
15 D 15
16 D 17
17 F 31
18 F 15
19 F 36
20 F 46
21 G 28
22 G 40
23 G 37
24 G 39
;
Run;

History of ANOVA analyses in SAS

1966 – SAS is released with Proc ANOVA, which is to be used with:

  • balanced data ONLY!
  • FIXED effects ONLY!
  • NOTE from SAS Online Docs: “Caution:If you use PROC ANOVA for analysis of unbalanced data, you must assume responsibility for the validity of the results.

1976 – SAS released Proc GLM

  • balanced (Type I SS) and unbalanced (Type III SS)
  • RANDOM statement introduced – provides EMS (expected mean squares equations, but you need to do the calculations!)

1992 – Proc MIXED

  • RANDOM statement incorporated
  • REPEATED statement introduced
  • “Normally distributed” data ONLY
  • linear effects

1992 – Proc GENMOD

  • Non-normal data
  • Fixed effects ONLY

xxxx? – Proc NLMIXED

  • normal, binomial, Poisson distributions
  • nonlinear effects

2005 – Proc GLIMMIX

  • Proc MIXED
  • Proc NLMIXED
  • Non-normal data

Proc GLM – General Linear Model

Proc GLM was the second generation PROCedure developed in SAS to conduct ANOVAs (analysis of variance).  This Proc is still used today for situations where you have a FIXED effects model and a balanced design – same number of observations in each treatment group.

Proc glm data=crd;
  class trmt;
  model weight = trmt;
  title “Proc GLM Results”;
Run;
Quit;

Proc glm – calls on the GLM Procedure.  data=crd – specifies the dataset which you want Proc GLM to use.

Class statement – list your classification variables here.  Think of these variables are those that tell you which group your observations fall into.

Model statement – this should be based on your experimental design.  In this case we have a CRD – our dependent variable = independent variable or our fixed effect.

Title statement – another great little habit to start.  Create a title statement for each procedure you use.  This way you will have a title at the top of our output window.  You will never guess again as to what that output was about.  If you want more titles or subtitles simply type title2 or title 3, etc….  You can also use the Footnote option to add notes to the bottom of our output page.

Run statement finishes the Procedure.

Quit statement will let SAS know that you do not want to add any more information to the Proc GLM.  Proc GLM is one of the few SAS Procedures that will wait for more instructions by running in the background.  In order to close it out, you will need to add a Quit.

View Proc GLM Results

Proc MIXED

With the increasing use of mixed models – models that include both fixed and random effects, Proc MIXED was developed.   Proc MIXED can also account for unbalanced designs.  Using the same CRD dataset:

Proc mixed data=crd;
  class trmt;
  model weight = trmt;
  title “Proc MIXED Results”;
Run;

 

You should obtain the SAME results with both procedures with a basic CRD design. For most straightforward models, Proc GLM and Proc MIXED should yield the same results.

Proc mixed – calls on the MIXED Procedure.  data=crd – specifies the dataset which you want Proc MIXED to use.

Class statement – list your classification variables here.  Think of these variables are those that tell you which group your observations fall into.

Model statement – this should be based on your experimental design.  In this case we have a CRD – our dependent variable = independent variable or our fixed effect.

Run statement finishes the Procedure.

View Proc MIXED Results

Proc GLIMMIX

Proc GLIMMIX does it all!  ok, almost.  For our purposes, Proc GLIMMIX handles the different types of experimental designs that are used in OAC and in the agricultural field.

Proc glimmix data=crd;
  class trmt;
  model weight = trmt;
  title “Proc GLIMMIX Results”;
Run;

Proc glimmix – calls on the GLIMMIX Procedure.  data=crd – specifies the dataset which you want Proc GLIMMIX to use.

Class statement – list your classification variables here.  Think of these variables are those that tell you which group your observations fall into.

Model statement – this should be based on your experimental design.  In this case we have a CRD – our dependent variable = independent variable or our fixed effect.

Run statement finishes the Procedure.

View Proc GLIMMIX Results