Data Visualization: The Graph

Before we start on the adventure of creating different visualizations, let’s continue to talk about forms of visualization.  Last time we talked about the Table, now let’s take a similar approach and discuss the Graph or the Chart.

Why use a Graph / Chart?

First let me note that I will more than likely use the words graph and chart interchangeably.

  • When you want to communicate a trend or a pattern in your data
  • Graphs/Charts are more interesting than a table of numbers – let’s be honest here
  • Your reader will tend to remember a graph / chart more readily than a table

Types of Charts

I have been asked on many an occasion, what type of chart should I use to represent my data?  There are so many different types of charts, and which one you choose will always depend on what you are trying to communicate to your reader.  Here is a partial list of the different types of charts:

  • Line chart
  • Bar chart
  • Histogram
  • Scatterplot
  • Pie chart
  • Bubble scatterplot
  • Heat chart
  • Area chart
  • Box and whisker chart
  • Radar chart
  • …..

Anatomy of a Chart

Similar to a table, a chart should be able to stand on its own.  A Title, legend, and footnotes, should provide the reader with enough information that they be able to interpret the graph as it was meant to be.  There are many documents available in textbooks and online to provide you with a guideline on the proper construction of a chart, but I will highlight one.

However, let’s first start with the title:

  • It should be clear and concise
  • Also known as the HEADING – according to the Operations Manual for the Canadian Journal of Plant Science, Canadian Journal of Soil Science, and the Canadian Journal of Animal Science.
  • Capitalize the heading in sentence format with no period at the end
  • Do not indent the second and any subsequent lines
  • No units of measure in the title

From the Designing Science book http://dx.doi.org/10.1016/B978-0-12-385969-3.00008-8   you can find a wonderful diagram on the Anatomy of the Chart in Figure 1.  Highlighting the structure of a chart which may include the following items:

  • y-axis
  • y-axis label
  • x-axis
  • x-axis label
  • key to symbols used in the chart
  • Statistical symbols
  • Major ticks
  • Minor ticks
  • Error bars
  • Symbols on the chart

Depending on the type of chart you will be creating all or some of the above features will be extremely important.

Best Practices for creating a Chart

  • 2D chart is always better and easier to understand than a 3D chart
  • Background colour – keep it simple – white works great or the colour of your presentation
  • Axes colour – Use the highest contrast colour – black?
  • Data colour – keep group colours consistent – distinct from each other
  • Gridlines – only if you need them – they can be very distracting
  • Font – use sans serif type fonts.  Helvetica is recommended
  • Significance – use *, **, *** – consistent
  • Error/variability – do not clutter your chart
  • Ticks on your scales – use a natural count

Selecting a Graph or a Table?

When should you use which one?

  • Remember a table is appealing to the “reader”
  • If you need your audience to remember a number or if you need to highlight a number/value – a table may be best
  • Graphs / charts – may have that impact that tables do not.
  • Each has their place and purpose

Examples of Charts

While reviewing these examples of different types of charts published in journals and on the web – think about the best practices listed above.  Do all of these examples follow them?  Are they easily read?  Do they stand alone?

When you embark on creating your own table and/or chart, think about these guidelines and best practices.

Name

Tackling an analysis using GLIMMIX

So, you have some data and you want to analyze it using Proc GLIMMIX.  You have some data which you’ve collected and have a few treatments which you’d like to compare.  So how do you start this?

My goal is to provide steps to tackle these types of analyses, whether you are working with weed data, or animal data, or yield data.  I suspect I’ll be updating this post as we clarify these steps.

First Step – your experimental design

Ah yes!  Despite popular belief you DO have an experimental design!  Find it or figure it out now before you go any further.  Why?  Because your model depends on this!  Your analysis comes down to your experimental design.

Second Step – build your MODEL statement

You know what your outcome variable is, you know what your experimental design is, which means you know what factors that you’ve measured and whether they are fixed or random.  So…  you now know the basis of your MODEL statement and your initial RANDOM statement.

Third Step – expected distribution of your outcome variable

You already know whether your outcome variable comes from a normal distribution of not.  Chances are it is not, but what is it?  Check out the post on Non-Gaussian Distributions to get an idea of what distribution your outcome variable may be.  Think of it as the starting point.

Add this distribution and the appropriate LINK to the end of our MODEL statement.

Fourth Step – run model and check residuals

Remember that when we run the Proc GLIMMIX – we need to check our assumptions – the residuals!  How do they look?  How’s the variation between your fixed effect levels?  Homogeneous or not?  Are the residuals evenly distributed?  Are the residuals normally distributed?

Fifth Step – residuals NOT normally distributed

Is there another LINK for the DISTribution that you selected?  If so, please try it.

Sixth Step – fixed treatment effects not homogeneous

Now the fun begins.  To fix this one, we need to add a second RANDOM statement – essentially telling SAS that we need to it to use the variation of the individual treatment levels rather than the residual variation.  As an example, a RANDOM statement, for a design that has a random block effect, would be as follows:

RANDOM _residual_ / subject = block*treatment group=treatment;

Seventh Step – try another distribution

Now – we do NOT want you trying ALL the distributions possible – this just doesn’t make sense.  Remember you need to think back to the distribution possibilities for our outcome variable.  Please use the link provided in Step 3 as a guide.  However, one distribution I have discovered works for many situations is the lognormal distribution.  At the end of your model statement you would add / DIST=lognormal LINK=identity.

Another option is to transform the data in the GLIMMIX procedure.  The one transformation that researchers like is the arcsine square root transformation.  To try this one please use the following code.

Proc GLIMMIX data=first;
trans = arsin(sqrt(outcome));

model trans = …;

Run;

Last Step – results will not always be perfect!

You will do the best that you can when analyzing your data.  But please recognize that you may not be able to match all the assumptions everytime.  Go back, review your data, review your experimental design, to ensure you have the correct proc GLIMMIX coding.

As I’ve noted earlier, as we continue to learn more about GLIMMIX this post will probably be updated to include and/or refine these steps.

Name

ARCHIVE: Ridgetown – Workshop data – November 6, 2017

We’ll continue to work with the data that Brittany provided me earlier this semester.  I’d like you all to try developing a model for the CONTROL_56D data.  Using GLIMMIX, try to find the best fitting model.

On Monday, November 6, we will work together and discuss how everyone approached this challenge.

Please download the data and program here.  Since I cannot link to .sas files, I have provided you with the PDF file.  You’ll need to copy and paste the contents into a SAS editor and work from there.

Also note that I will be available for one-on-one consultations in the afternoon of Monday, November 6, 2017.  To book a timeslot, please visit:  http://rt_oacstats.youcanbook.me 

Crimes of Statistics: Power

To consider the POWER of your statistical analysis, we need to take a step back and talk briefly about Hypothesis tests and their relationship with POWER.

Remember how you start your research?  With a hypothesis.  For our little example we will have an hypothesis statement that says the mean height of cats is equal to the mean height of dogs.  The alternate hypothesis would then say that the mean height of cats is not equal to the mean height of dogs.

Ho: µcats  = µdogs
Ha: µcats  ≠ µdogs

We are using an alpha value of 5%, therefore our p-value = 0.05.  We went out to measure 4 cats and 4 dogs and their height measurements (inches) are:
Cats:  11, 13, 11, 14
Dogs:  24, 21, 18, 28

The mean height for cats is 12.5 with a standard deviation of 1.5
The mean height for dogs is 22.8 with a standard deviation of 4.3

I can conduct a t-test and it provides me with a p-value of 0.02.  With data such as this I can also calculate the variation around the mean, such that I have 11.0-14.0 (12.5 ± 1.5) for the cats and 18.5-27.1 (22.8 ± 4.3) for the dogs.  Do the ranges overlap? No.

What conclusion do we draw?
That we will reject the Null hypothesis and state that dogs are significantly taller than cats by an average of 10″.

Sounds great right?  We did expect that the dogs would be taller than cats.  So right from the beginning, in this example, our experience and knowledge of cats and dogs, told us  that the Null hypothesis was false – and with our little sample we proved it!

Let’s review this table – in our case we were working with a Ho that we knew to be false and we rejected the Ho – so we have NO ERROR.

  Ho is TRUE Ho is FALSE
REJECT the NULL Hypothesis Type I error
(ALPHA)
No error
(POWER = 1-BETA)
ACCEPT the NULL Hypothesis No error
(1-ALPHA)
Type II error
(BETA)

We’re going to repeat this experiment and measure another 8 animals – 4 cats and 4 dogs.

Ho: µcats  = µdogs
Ha: µcats  ≠ µdogs

We are again using an alpha value of 5%, therefore our p-value = 0.05.  We have height measurements (inches) of 4 cats and 4 dogs:
Cats:  21, 13, 11, 14
Dogs:  23, 21, 18, 14

The mean height for cats is 14.8 with a standard deviation of 4.3
The mean height for dogs is 19.0 with a standard deviation of 3.9

I can conduct a t-test and it provides me with a p-value of 0.19.  With data such as this I can calculate the variation around the mean, such that I have 10.5-19.1 (14.8 ± 4.3) for the cats and 15.1-22.9 (19.0 ± 3.9) for the dogs.  Do the ranges overlap?  Yes.

What conclusion do we draw?
That we will NOT reject the Null hypothesis and state that the average height of cats and dogs is the same.

Are we comfortable with this?  If you review the table presented above – now we still have a FALSE Ho and this time around we did NOT reject the Null hypothesis – leading us to committing a Type II or Beta error.

A Type II error is directly related to the POWER of the test.  By definition, the power of a statistical test, is the probability that the test will correctly reject the null hypothesis when it is false.

POWER is related to a number of factors:

  • sample size
  • effect size – or the size of the difference between treatment groups
  • variation of our outcome variable
  • level of significance – p-value

Consider our example above, what factors could be change to increase the POWER of our test and ensure that we won’t see similar results to the second time we collected data?

  • Sample size

There are several ways to calculate the POWER of a statistical test.  SAS has 2 PROCs – Proc POWER and Proc GLMPOWER.  Review the SASsy Fridays post on these.  There are many links to online calculators as well.  Please choose one that is defendable.

 

 

Data Visualization: The Table

Data Analysis Tasks and Methods

What a great visual to help you decide how to visualize your data.   By studying this “visualization” you can see that tables and graphs are primarily used to summarize data and to find relationships.

Tables vs Graphs

Tables:

  • Verbal representation
  • Read the information in rows or columns

Graphs:

  • Visual representation
  • See patterns or relationships

Neither one is better than the other – they each have their own merits and purposes.  It is up to you as the researcher to decide which is more appropriate for your story.

When would you use a Table?

  • Look up individual values
  • Compare pairs of related values
  • Need precision
  • Multiple sets of values in different measures
  • Show summary and detailed information

When would you use a Graph?

  • Show relationships among and between sets of values by giving them shape
  • Patterns, trends and exceptions are more easily seen rather than read
  • Series of values – seen as a whole

Different types of Tables

  1. Data table
    • Show rows and columns of data
    • Very difficult if not impossible to see any trends or relationships by looking at raw data
  2. Contingency Table – or a Crosstab(ulation) Table
    • Can show the relationship between two variables
    • Variables MUST be categorical!
  3. Summary or Aggregate Tables
    • Show descriptive statistics such as: mean, minimum, maximum, standard deviation, standard error, etc…
    • Can also group these by a categorical variable

Anatomy of a Table

Remember that a table should stand on its own!!!

Highlights of the anatomy of a table.  If you are publishing, please check with the publication you are submitting to.  The guidelines listed below have been pooled from a number of different sources and are meant to be used as a teaching tool and guide only.

TITLE:  Should be clear and concise
Also known as the HEADING – according to the Operations Manual for the Canadian Journal of Plant Science, Canadian Journal of Soil Science, and the Canadian Journal of Animal Science.

  • Capitalize the heading in sentence format with no period at the end
  • Do not indent the second and any subsequent lines
  • No units of measure in the title

COLUMN TITLES: Visible and concise
Also known as COLUMN HEADINGS 

  • Capitalize only the first word
  • Units of measure in parentheses on the last line of the subheading
  • If several headings share the same UOM, place below the headings, centred

LINES:  to separate different parts of the table

BODY: 

  • headings within the body of the table need to be italicized
  • centre entries under the column headings
  • centre data within the columns on decimal point, dashes, etc..

FOOTNOTES: used to clarify information in the table and should always appear at the bottom of the table!

  • Footnotes start with the letter a as a superscript
  • Each footnote is on a separate line
  • Asterisk – * to designate statistical significance

Examples of Tables

Data Table

Crosstabulation Table

  • Select Add/Remove Data
  • Under Geography – select all provinces
  • Select Apply at the bottom of the page
  • This will create a Crosstab table of geography vs Quarter/year

Research results table

Designing a table

Items to think about when you are designing a table.  A statistical package may not always provide you with the ideal table 🙂

  1. if you are comparing categories, these should be presented vertically in columns rather than rows.
  2. Row entries of data should not be random – order them by importance or alphabetically.
  3. If you are presenting more than one level of categories, arrange the hierarchy to emphasize the categories you think are most important

Example:

Steer weight                                                   Heifer weight
1981     1991     2001    2011                          1981    1991    2001    2011

Versus

1981                                                      1991
Steer weight     Heifer weight          Steer weight    Heifer weight

Summary

  • Often used to present a lot of data
  • Audience will glaze over the table and may not remember the message behind it.
  • Not recommended to use a table to show patterns, trends, or interactions between values – this may be easier to see and remember by using a more visual object
  • Remember who your audience is!!