RDM – Page 4 – Agricultural Statistics

Powerpoint Presentation used during Workshop – June 14, 2017

This workshop was a continuation of the 1st RDM workshop where we concentrated on creating new datasets with variable names that matched Best Practices for reading your data in a Statistical analysis package. This workshop looks at how do you handle data that has been passed on to you from another individual, and all the challenges that accompany “inheriting” or acquiring data.

There were 2 exercises in this workshop. The first exercise, you were provided a small dataset and a number of variables and titles for each. You were asked to determine what types of questions you would need answered before you were able to work with this dataset effectively. Below are links to the 4 datasets along with a sample of questions that you may need to ask.

The second exercise that was conducted in this workshop, was one where you were provided with one of two directories that held a number of files. You were asked to review the directory structure and the file naming conventions, and provide a new one that was consistent with the recommended Best Practices presented in the workshop. Below are links to proposed answer keys to both directories.

Group1_folder_directory_answersheet

Group2_folder_directory_answersheet

To reiterate that the primary goal of these workshops, is that you take all the recommended Best Practices presented and implement them with your own data

Powerpoint Used during the 20170607 workshop

Commonly Used Statistical Packages

For the purposes of this workshop, the following statistical packages were reviewed:

SAS
SPSS
Stata
R
Matlab

It is recognized that there are many more available and used in the OAC community. If you have questions regarding other packages not included here, please email oacstats@uoguelph.ca .

Commonly Used Statistical Packages: Variable name restrictions and limits

LENGTH OF THE VARIABLE NAME

SAS – 32 characters long
SPSS – 64 bytes long
• 64 characters in English
• 32 characters in Chinese
Stata – 32 characters long
R – 10,000 characters long
Matlab – 63 characters lo

1ST CHARACTER OF THE VARIABLE NAME

SAS – 1st character MUST be:
• a letter (English) OR
• an underscore “_”
SPSS – 1st character MUST be:
• a letter (English) OR
• an underscore “_” OR
• “@”,“#”,“$”
Stata – 1st character MUST be:
• a letter (English) OR
• an underscore “_”
R – NA
Matlab – 1st character MUST be:
• a letter

BLANKS IN VARIABLE NAMES

SAS – NO Blanks!
SPSS – NO Blanks!
Stata – NO Blanks!
R – NO Blanks!
Matlab – NO Blanks!

SPECIAL CHARACTERS IN VARIABLE NAMES

SAS – NO Special characters with the exception of:
• “_”
SPSS -NO Special characters with the exception of:
• “_”
• “.”
• “@”
Stata -NO Special characters with the exception of:
• “_”
R -NO Special characters with the exception of:
• “_”
• “.”
Matlab – NA

CASE IN VARIABLE NAMES

SAS – Mixed case – for presentation only
SPSS – Mixed case – for presentation only
Stata – Mixed case – for presentation only
R – Mixed case – for presentation only
Matlab – Case sensitive

NAMES/WORDS TO AVOID IN VARIABLE NAMES

SAS – SAS keywords
SPSS – SPSS Reserve words
Stata – NA
R – R function words
Matlab – Function names

GENERAL NOTES ABOUT VARIABLE NAMES

SAS – Libref names can only be 8 characters long
SPSS – #variable – is a scratch variable used in syntax
• $variable – is a system variable
• Do NOT end variable name with a “.” OR “_”
Stata – NA
R – R Community recommends that you develop a naming convention for your data
• Use of “_” is faster (10-20%) than the use of “.”
Matlab – NA

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Names

LENGTH RECOMMENDATION

Maximum length: 32 characters
Keep the variable names short and use a variable label to provide more information. Remember you need to type these variable names in and you will need to remember them.

1^ST CHARACTER OF A VARIABLE NAME

ALWAYS start variable names with a letter

VARIABLE NAMES AND SPECIAL CHARACTERS

Numbers may be used anywhere in the variable name AFTER the first character
Only use underscores “_”
Do NOT use BLANKS – replace blanks with an underscore “_”

CASE

Use lowercase
Case doesn’t matter for most packages.
If you are using MatLab – please be aware that the variable names are case sensitive – if you use lowercase as a Best Practice you won’t forget which ones are Capitals and which ones are NoT.

FAMILIARITY WITH STATISTICAL PACKAGE NOMENCLATURE

As you work with a particular package you will become familiar with keywords or reference words that are reserved for the program to use.
As a general rule keep away from Statistical terms as variable names. If you REALLY want to use “mean” qualify it with your data, so wt_mean or concentration_mean

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Labels

Variable names are often short and may not reflect the contents of the data collected. Trying to create a variable name that is a descriptive summary of the data can be extremely challenging. Recommendation is to create short, concise variable names and to create variable labels that are descriptive for each variable name.

Variable name: wt28

Variable label: Weight (kg) at 28 days of age

CREATING VARIABLE LABELS – SAS

Data first;
Infile …
Input …

Label wt28 = “Weight (kg) at 28 days of age”;
Run;

CREATING VARIABLE LABELS – SPSS

Variable View in the SPSS Data viewer
- Find the variable called wt28
- In the Column called Label – Type: Weight (kg) at 28 days of age
Syntax Window:

VARIABLE LABELS

Wt28 “Weight (kg) at 28 days of age”.

CREATING VARIABLE LABELS – R

Apply the appropriate function for the space you are working in. For instance the Dataframe, Vector, etc..

Lapply function

CREATING VARIABLE LABELS – STATA

label variable wt28 “Weight (kg) at 28 days of age”

CREATING VARIABLE LABELS – MATLAB

T.Properties.VariableDecsriptions{‘wt28’}=”Weight (kg) at 28 days of age”;

Category: RDM

ARCHIVE: RDM workshops – New series coming in October

S17 RDM Workshop: Data Preservation: The Legacy of your Research Data

ARCHIVE: S17 RDM Workshop: Best Practices for entering your Research Data using Excel

Commonly Used Statistical Packages

Commonly Used Statistical Packages: Variable name restrictions and limits

LENGTH OF THE VARIABLE NAME

1ST CHARACTER OF THE VARIABLE NAME

BLANKS IN VARIABLE NAMES

SPECIAL CHARACTERS IN VARIABLE NAMES

CASE IN VARIABLE NAMES

NAMES/WORDS TO AVOID IN VARIABLE NAMES

GENERAL NOTES ABOUT VARIABLE NAMES

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Names

LENGTH RECOMMENDATION

1^ST CHARACTER OF A VARIABLE NAME

VARIABLE NAMES AND SPECIAL CHARACTERS

CASE

FAMILIARITY WITH STATISTICAL PACKAGE NOMENCLATURE

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Labels

CREATING VARIABLE LABELS – SAS

CREATING VARIABLE LABELS – SPSS

CREATING VARIABLE LABELS – R

CREATING VARIABLE LABELS – STATA

CREATING VARIABLE LABELS – MATLAB

Commonly Used Statistical Packages

Commonly Used Statistical Packages: Variable name restrictions and limits

LENGTH OF THE VARIABLE NAME

1ST CHARACTER OF THE VARIABLE NAME

BLANKS IN VARIABLE NAMES

SPECIAL CHARACTERS IN VARIABLE NAMES

CASE IN VARIABLE NAMES

NAMES/WORDS TO AVOID IN VARIABLE NAMES

GENERAL NOTES ABOUT VARIABLE NAMES

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Names

LENGTH RECOMMENDATION

1ST CHARACTER OF A VARIABLE NAME

VARIABLE NAMES AND SPECIAL CHARACTERS

CASE

FAMILIARITY WITH STATISTICAL PACKAGE NOMENCLATURE

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Labels

CREATING VARIABLE LABELS – SAS

CREATING VARIABLE LABELS – SPSS

CREATING VARIABLE LABELS – R

CREATING VARIABLE LABELS – STATA

CREATING VARIABLE LABELS – MATLAB

1^ST CHARACTER OF A VARIABLE NAME