S17 SAS Workshop: Getting comfortable with my data in SAS. Descriptive Statistics

PDF Copy of Online notes – 20170711

Quick Review of reading Data into SAS

Preparing Data

  1. Variable names in the first row – make sure they are appropriate for the statistical software you are using.  For more information check out the Best Practices for Entering your Research Data using Excel
  2. Save your Excel file as a CSV – if you are using the INFILE statement.  Please note for Mac users, you MUST save as MSDOS-CSV!

SAS Studio Users

  1. Upload your CSV file to your SAS Studio
  2. Remember to right-click on the file once it is in My Files to obtain its location for the INFILE statement.

Copying and Pasting from Excel

  1. With smaller datasets this works fine
  2. But you need to remember where your MASTER dataset is!!

Download Excel file for this workshop Dataset

Data tuesday;
  input ID group$ trmt age height eye_colour;
  datalines;
1  a  1  39  137  2
2  a  1  35  140  2

;
Run;

Using an INFILE statement

Data tuesday;

infile “C:\Users\edwardsm\Documents\Workshops\SAS\Level_I\SASI_2\dataset.csv”                  dlm=”,” firstobs =2 missover;
  input ID group$ trmt age height eye_colour;
Run;

Checking your data

Use a Proc Print – to make sure that SAS has read in your data correctly.  ALWAYS read the LOG window.  You will see how many lines of observations are in the file and how many variables were read.  You should also see information about the data your read in.  If you’re using the INFILE statement, you will see characteristics about the file.

Proc print data=tuesday;
Run;

Adding variable labels

Do you know what group, trmt represent?  We can probably guess what age, height, and eye_colour mean, but would you know what units age and height were measured in?  Without a codebook or information, such as labels for the variables and value labels for the variable values, you would be guessing!

In SAS, and with many other statistical programs, you can add both a variable label and value labels.

Whenever you work with the data, you need to be working in a DATA step.  Drawing parallels to Excel, you will need to open a new dataset or excel worksheet, make the changes and then save it.  In SAS, you will create a new DATA Step, make the changes to the variable(s), and save it.

Data tuesday_new;
  set tuesday;        * this tells SAS that you want to use the dataset called tuesday that you                                    created earlier;
label
  group = “Individuals on the trial were randomly assigned to 4 groups”
  trmt = “Treatments were assigned within each group”
  age = “Age of the participant in years”
  height = “Height taken of the participants at the end of the trial, measured in cm”
  eye_colour = “Colour of the participants’ eyes”;
Run;

To view these changes, try a Proc print – what happens??

Try the following:

Proc Contents data=tuesday_new;
Run;

What do you see?

ProcContents_labels

Adding Value labels

Sometimes you will collect variables that are coded.  Rather than writing Blue eyes, brown eyes, you might provide them with a code such as 1,2, etc…  But how do you remember what code you gave what value?  Writing it down on a piece of paper is fine, but what if you misplace that paper?  Adding value labels to your data is a great way to keep all the information together.

To accomplish this in SAS, it is a 2-step process.  We need to create the codes and their labels first, and then we need to apply these to the variables in the dataset.  This allows you to re-use the labels.

Creating the value labels

Proc format;
  value $groupformat
                a = “Group A – Monday morning”
                b = “Group B – Monday afternoon”
                c = “Group C – Tuesday morning”
                d = “Group D – Tuesday afternoon”;

  value trmtformat
               1 = “Treatment 1 – Placebo”
               2 = “Treatment 2 – Vitamin C”;
Run;

This creates SAS formats.  One called groupformat and another called trmt format.  Think of these as boxes that say a represents Group A – Monday morning, etc..

Applying the value formats to the data

Remember that we are touching the data or making changes to the data, so we need to use a Data Step.  Let’s re-use the one where we added variable labels:

Data tuesday_new;
  set tuesday;       

label
  group = “Individuals on the trial were randomly assigned to 4 groups”
  trmt = “Treatments were assigned within each group”
  age = “Age of the participant in years”
  height = “Height taken of the participants at the end of the trial, measured in cm”
  eye_colour = “Colour of the participants’ eyes”;

format
  group groupformat.
  trmt trmtformat.

Run;

Permanent vs Temporary SAS Datasets

We we work with SAS, when you look in the LOG window you see reference to something called WORK.TUESDAY or WORK.TUESDAY_NEW.  We didn’t add the WORK part, so where did that come from?

SAS organizes the data it reads in a Library.  The default library is called the WORK library.  This is temporary, which means that when I shut down SAS, all the datasets that were read into SAS are deleted.  Your original Excel files are still there, as is your SAS coding (if you saved it).  But any of the temporary SAS datasets are deleted.

We can create permanent SAS datasets however.  These will be physical files with the file ending of .sas7bdat  For extremely large files, this may be the best way to handle them.  Read them into SAS once and save them.

To do this we need to create a SAS library reference to a physical location on our laptop/computer.

libname sasdata “C:\Users\edwardsm\Documents\Workshops\SAS”;

This maps the location to the SAS libraries in the “black box” of the SAS program.  To save a permanent SAS datafile to this location we do the following:

Data sasdata.tuesday_new;
  set tuesday_new;
Run;

We simply change the first name of WORK to our library name SASDATA.  Check out your log window to see what happened!  Also check your computer to see if you can find that file.

NB: I’m not sure how this works with SAS Studio!

Descriptive Statistics

We will run Proc freq and Proc means to describe the data we have just read.

Here is a link to the SAS_20170609_ME that was used in this workshop.

 

S17 SAS Workshop: Introduction to SAS at the University of Guelph – How do I get my data in?

PDF Copy of Online notes – 20170711

Available Versions of SAS

  • PC Standalone Version – PC-SAS
    • Available for Windows ONLY – if you’re using a Mac, you will need to have a VM to emulate Windows to run this version
    • Available through CCS Software Distribution Centre – $118.63 for a new license and $75/year renewal license. This information was downloaded on May 29, 2017, please check https://guelph.onthehub.com/WebStore/Welcome.aspx for updated pricing and access information or email 58888help@uoguelph.ca for more information
  • Animal Biosciences department ONLY
    • Access the server version of PC-SAS
  • SAS University Edition
    • This is free for all academics to use. You can download the free version from https://www.sas.com/en_ca/software/university-edition.html
    • This is available for both Mac and Windows users
    • Please note, that you will be required to update this version every year.  SAS will send you a reminder notice, approximately 1 year from your installation date.
  • SAS OnDemand
    • This is also free for academics
    • This is SAS’ in the cloud version of the University Edition
    • Environment is the same as the University Edition, the difference is that you are using the SAS service in the Cloud, all your files are stored in the Cloud and not on your local system, and you are using their computer resources NOT your own system – accessed through a web browser with your own personal login

What Parts of SAS do you have access to?

SAS is an extremely large and complex software program with many different components.  We primarily use Base SAS, SAS/STAT, SAS/ACCESS, and maybe bits and pieces of other components such as SAS/IML.

SAS University Edition and SAS OnDemand both use SAS Studio.  SAS Studio is an interface to the SAS program and contains the following components:

  • BaseSAS – base SAS programming, DATA Step
  • SAS/STAT – the PROCs used for statistical analyses
  • SAS/IML – SAS’ matrix programming language
  • SAS/ACCESS – allows you to interact with different data formats
  • Some parts of SAS/ETS – time series analysis

If you are using the PC or Server SAS versions, you may have access to more than the modules listed above.  To see exactly what you have access to, you can run the following code:

Proc Setinit;
Run;

You will see the components available to you listed in the log window.

SAS_setinit_log_window_results.png

Also note the additional information available to you:

  • License information
  • Expiration date – very handy to be aware of, especially if you are running your own copy of your PC
  • SAS components available to you

What does SAS look like?

There are a number of components to the SAS interface:

  • Results and Explorer windows to the left
  • Editor, Log, Output, and Results Viewer windows to the right, taking up most of the screen

SAS_Windows_interface_Contents_window_Editor_Log_Windows.jpg

What do each of these windows do?

  • Results Window –  a Table of Contents for all of your results.
  • Explorer Window – similar to Windows Explorer – allows you to navigate SAS libraries and files
  • Editor Window – this is where you will spend most of your time, writing and editing program files
  • Log Window – this window is extremely helpful, think of it as your best friend in SAS, it tells you what SAS has done every step of your program and processing
  • Output Window – SAS versions 9.2 and earlier, use this window to display all results and output.  SAS 9.3 and higher use a new window called the Results Viewer.  All the results are presented in an HTML format.

How does SAS work?

SAS is divided into 2 areas:

  • DATA step
  • PROCs (short for PROCedures)

DATA step is all about data manipulation – one of the key strengths to SAS
PROCs – this is where you will find most of your statistical procedures.

How do you get data into SAS?

The primary reason we use SAS is to perform statistical analyses on some data.  However, we need to ensure that the data we have collected is brought into SAS correctly.  I’m sure you’ve heard of “garbage in, garbage out”?  This cannot be more truer than when you collect data and bring it into a statistical package.

There are different ways to bring data into SAS.  I will try to review and provide my thoughts on 3 different ways I see my students performing this task.  However, before we import data into any software package, we need to ensure the data is “clean” and in a format that will be accepted into the package.  So let’s talk about the most common way researchers enter their data – EXCEL.

Using Excel to enter data and Statistical Software packages

Most people use Excel to enter their data and that’s great!  The look of it is neat, ordered and we can do quick summaries, such as means and sums.  We can also make Excel look pretty by adding colours, headings, footnotes, or maybe notes about what we did and how.  In the end, Excel can be a very versatile tool.  But, we need to keep in mind that Excel is NOT a statistical package and that we are using it to collect our data.  That being said, I recognize many people use it for more than it was set out to be.

Let’s take a look at an example of how Excel is used.

2017_Cookie_trial

Everyone uses Excel differently when entering data.  This file is a very simple example.  Many people will highlight cells or add comments, etc…  Every file will need to be “cleaned” before it can be used in SAS.  These are recommended steps to clean any Excel file.

Recommended steps to clean an Excel file:

  • Copy the entire sheet into a blank worksheet.  This allows you to keep the formatted version while working on a clean version.
  • Label the new worksheet SAS or something that makes sense to you.  This way when we import the data you will know which worksheet contains the clean data.
  • Remove all the formatting.  In Excel, Click on the CLEAR button and select Clear Formats.  This will remove all Excel formatting from the worksheet.
  • The top row of the Excel file needs to contain the name of the variables you wish to use in SAS.  Note that some files may have titles and/or long descriptions at the top of the worksheet.  These need to be deleted.
  • The top row of the Excel file needs to contain the name of the variables you wish to use in SAS.  You will now need to modify the headings of the columns.  For instance:
  • The variable names are ones that will have a significance to you.  Please DOCUMENT these changes so you know what is contained in your dataset!  I will provide more information on Variable Labels and Value Labels in a follow-up post
  • Don’t forget to save your Excel file!
  • If there are any notes at the bottom of your worksheet or anywhere else in the worksheet – you will need to delete these.

RECAP:

  1. Copy data into new worksheet
  2. Rename worksheet for easy identification later
  3. Clean variable names in the first row
  4. Second row contains your data and NOT blanks

TIPS:

SAS naming conventions:

  • variable names do not contain any “funny” characters such as *, %, $, etc…
  • variable names begin with a letter
  • variable names may contain a number, but cannot begin with a number
  • NO spaces allowed!  You may use _ in place of a space

IMPORTING EXCEL FILES INTO SAS – works best with individual PC-SAS license

Using the IMPORT feature in SAS is probably the easiest way of bringing data into the SAS program.  To import the data follow these steps:

  1. In the SAS program – File -> Import Data
  2. You will now answer each step of the Import Wizard.
  3. With SAS 9.2, you will need to save your Excel files as the older 97-2003 .xls format.  This version of SAS does NOT recognize the .xlsx ending for Excel
  4. Browse your computer to find and select your Excel file
  5. Select the worksheet in the file using the dropdown box in SAS.  This is why I suggested earlier to call it SAS or something you will remember
  6. This next step can be tricky.  Leave the Library setting to WORK.  In the Member box provide SAS with a name for your datafile to be saved in SAS.  For this example let’s call it COOKIE
  7. The next step is optional!  If you are planning on importing more files that have the same structure or where your answers to the Wizard will be the same, this step allows you to save the program (or syntax) that SAS creates to import the file.
  8. Finish and your file is now in SAS
  9. Check the Log Window

 

COPY AND PASTE DATA INTO SAS

As much as I would like to discourage people from using this method of bringing data into SAS, it is a viable option about 95% of the time.  In most cases this method will work, however there is the odd case, about 5% of the time where this method will fail.

Let’s work through how we enter data into Excel and translate our steps into SAS

First thing most of use do when entering data into Excel is to create variable names or headings in the first row of Excel.  We then begin to type our data in the second row.  When we’ve completed entering the data or we have a page full of data, that’s when most of us remember to save the file.  Sound familiar?

In SAS we can do all of these steps using a DATA Step.  We will be creating a program or writing syntax in the SAS editor for this bit.  To start, SAS likes us to save our file FIRST, before we enter any data – contrary to what we traditionally do in Excel.  We start our program

Data cookie;

The first thing we did in Excel was label our columns – this is the second line of our SAS code:

Data cookie;
Input ID trmt wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;

The next thing we do in Excel is start entering our data.  In SAS, we first let SAS know that the data is coming by adding a datalines; statement in our code and then we enter our data.

Data cookie;
Input ID trmt wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;

Datalines;
13K Chocolate cookies 89 47 116 48 308 1232 27 0,48 22.0 45.6

In order to complete our data entry in SAS, we need to let SAS know that there are no more data points and to go ahead and save the file.  To do this we add a “;” all by itself at the end of the data and a Run; to let SAS finish the data and save the file.

Data cookie;
Input ID trmt wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;
Datalines;
13K Chocolate cookies 89 47 116 48 308 1232 27 0,48 22.0 45.6
;
Run;

Rather than retyping all the data, we can copy it from Excel and paste it after the datalines statement.  As I noted above, this will work most of the time, but there are times where it does not work.  Why you may ask?  I suspect it is some hidden Excel formatting that plays havoc with SAS, but I cannot identify exactly what it is.  Just note that this method may fail at times.

When you first try running the above program there are a few errors that show up.  The first one we need to deal with is that SAS LOVES numbers and needs you to specify the variables that are not numeric in nature.  In our dataset we have 2 variables that are string, characters, or a bunch of letters.  These are ID and TRMT.  To inform SAS that these are not numbers we need to add a “$” after the variable name.  The $ can be attached to the variable name or there can be a space between it and the variable name – it doesn’t matter.

Data cookie;
Input ID$ trmt$ wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;
Datalines;
13K Chocolate cookies 89 47 116 48 308 1232 27 0,48 22.0 45.6
;
Run;

When we run the SAS program there are still problems.  When we review the LOG window, we see that there are problems with wt28d.  Below is a small snippet from the LOG window:

1 Data cookie;
2 input ID$ trmt$ wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;
3 datalines;

NOTE: Invalid data for wt28d in line 4 15-21.
RULE: —-+—-1—-+—-2—-+—-3—-+—-4—-+—-5—-+—-6—-+—-7—-+—-8—-+-
4 13K Chocolate cookies 89 47 116 48 308 1232 27 0.48 22.0 45.6
ID=13K trmt=Chocolat wt28d=. ht28d=89 wt56d=47 ht56d=116 cc28d=48 cc56d=308 wtgain=1232 adg=27
adcc=0.48 cookie_gain=22 _ERROR_=1 _N_=1
NOTE: Invalid data for wt28d in line 5 12-17.
5 14K Ginger treats 105 50 134 54 80 80 29 0.52 1.4 2.8
ID=14K trmt=Ginger wt28d=. ht28d=105 wt56d=50 ht56d=134 cc28d=54 cc56d=80 wtgain=80 adg=29
adcc=0.52 cookie_gain=1.4 _ERROR_=1 _N_=2

When you read this you’ll notice a couple of items:

  1. you can see the full line of data – since it is copied from the editor window.
  2. you notice when SAS reads the data – ID is correct, however, the value for TRMT appears to be truncated to “Chocolat” from “Chocolate cookies”

What’s happening is that the default value for the length of any string or character variables in SAS is only 8 characters long.  The length of our TRMT variable is 17 for Chocolate cookies and 13 for “Ginger treats”.  To overcome this challenge, we need to inform SAS at the beginning of our code that the length of TRMT is longer than the default 8 characters, that we want it to be 17 characters long.  This way it will accommodate our Chocolate cookies value along with the Ginger treats.

To do this we need to add a LENGTH statement before the INPUT statement in SAS.  As soon as SAS reads the INPUT statement, it creates all the variables with the 8 character length.  By adding the LENGTH statement first, SAS now sets up the variable with whatever length we specify.

Data cookie;
Length trmt $17;
Input ID$ trmt$ wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;
Datalines;
13K Chocolate cookies 89 47 116 48 308 1232 27 0,48 22.0 45.6
;
Run;

As a sidenote, when we declare a variable length BEFORE the INFILE statement, SAS will put that variable first in the dataset.  When we read the INPUT statement we are reading ID TRMT Wt28D HT28D WT56D HT56D CC28D CC56D WTGAIN ADG ADCC COOKIE_GAIN – in that order.  But with a LENGTH statement at the beginning of our program, the order changes to read: TRMT  IDWt28D HT28D WT56D HT56D CC28D CC56D WTGAIN ADG ADCC COOKIE_GAIN.

READING DATA FROM A FILE

In order to read data that has been saved to a file, the INFILE statement must be used before the INPUT statement.   Think of it in these terms.  You need to tell SAS where the data is first (INFILE) before you can tell it what is contained inside the file (INPUT).  Another trick to remembering the order, the two statements are to be used in alphabetical order.

NOTE: Before we can read in a datafile into SAS, we need to save it in the proper format from Excel.  On a WINDOWS laptop/computer, in Excel, please select File -> Save As -> Save as type should be CSV(Comma Delimited).  On a Mac, in Excel, please select File -> Save As -> Type should be MS-DOS Comma Delimited.

Once you have a datafile that has been created in Excel or another program, and if that file is a text file, which means a file that only has data and spaces, then the INFILE statement will be only be used to tell SAS the location of the text file on the computer. Here is an example:

Data cookie;
Length trmt $17;
Infile “C:\Users\edwardsm\Documents\Workshops\SAS\July-2017\2017_Cookie_trial.csv”;
Input ID$ trmt$ wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;
Run;

A Comma Separated Values(CSV or Comma Delimited) File is one of the most common text files used for data today, probably more common than a text file.  If you use a text file, we assume that there are only empty spaces between the variable values.  With a CSV file there are commas (,) separating the values, so we need to tell SAS this.  This can be done by adding DLM (which is short for DELIMITER) = “,” at the end of the INFILE statement.

There is another aspect of our CSV files that we will need to tell SAS about.  When we are working in EXCEL and create our CSV files, we use the top row to list our variable names (to identify the variables).  This is fine, but again, we need to let SAS know that we don’t want it to start reading the data until the second row or whichever row your data starts in.  We do this by adding FIRSTOBS=2 at the end of the INFILE statement.  So we will have something that looks like this:

Data cookie;
Length $17;
Infile “C:\Users\edwardsm\Documents\Workshops\SAS\July-2017\2017_Cookie_trial.csv” dlm=”,” firstobs=2;
Input ID$ trmt$ wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;
Run;

With a CSV file, remember that we are using a “,” to separate the variable values or the columns that were in Excel.  What happens though, if we have commas in our data?  For example, instead of Chocolate cookies, we may have entered the data to show Cookies, chocolate.  If we leave the INFILE statement as it reads now, when SAS encounters one of those commas, it will move onto reading the next variable, which we know will fail or make a mess of our data.  To prevent this from happening we need to add the DSD option at the end of our INFILE statement.

And…  0ne last note about using the INFILE statement.  Quite often you will see one more option at the end of this statement, and one that I highly recommend:  MISSOVER.  Quite often when you use Excel to enter your research data, you will encounter times when you have no data.  Many people leave the cells blank.  When this happens at the end of a record or row in your datafile, SAS will see that blank and assume that the next variable value is on the next row.  Making a fine mess of reading in your data.  By adding the MISSOVER option at the end of the INFILE statement, you’re telling SAS that it’s fine that the cell is missing and to start the new row of the SAS dataset with the new row/record in Excel.

Data cookie;
Length $17;
Infile “C:\Users\edwardsm\Documents\Workshops\SAS\July-2017\2017_Cookie_trial.csv” dlm=”,” firstobs=2 missover dsd;
Input ID$ trmt$ wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;
Run;

READING DATA FROM A FILE USING SAS STUDIO

When working on your own PC-SAS on your system, identifying where your files are, can be accomplished by looking through the Windows Explorer.  However, when you’re using SAS Studio, because it uses a Virtual Machine, and you run the SAS program through a web browser, finding the right place for your files can be challenging.  A few extra steps to reading your data from a file using SAS Studio.

    1. Upload files to SAS Studio.
      • This will place your files within the SAS Studio environment, so that it can see them.
      • To do this – right click on Files(Home)
      • Select Upload Files
      • Select your file and upload

SAS_Studio_Server_Files_Interface.png

2.  Your INFILE statement will need to know where your files are located. They are not on C:\….  they are within the SAS Studio environment.  To find them, right-click on the file and select Properties

SAS_Studio_Server_Files_Properties_Interface.pngSAS_Studio_File_Propertes.png

3.  Copy the information in the Location section to use for your INFILE statement.

Data cookie;
Length $17;
Infile “/home/edwardsm0/2017_Cookie_trial.csv” dlm=”,” firstobs=2;
Input ID$ trmt$ wt28d ht28d wt56d ht56d cc28d cc56d wtgain adg adcc cookie_gain;

Run;

 

Viewing the data in SAS

We’ve just imported our data and I see nothing!  What happened?  Did I do something wrong?  My log window says my data has been successfully imported, but where did the data go?

Once you’ve imported your data, SAS saved it in a dataset within its program. So think of is as a blackbox and somewhere in that blackbox is a dataset called COOKIE.  How do you go about viewing it?  Let’s use a PROCedure called PRINT.

PROC PRINT will show you your data in the Output window.

Proc print data=cookie;
Run;

These statements will printout ALL the observations in your dataset.  Note when we say “printout” it prints to the screen and not to your printer.  Please note that specifying the dataset you are working with is an EXCELLENT habit to get into.  In this case we are interested in viewing the data contained in the COOKIE dataset – data=cookie

To view only the first 5 observations in this dataset we can add an option at the end of the Proc print statement.

Proc print data=cookie (obs=5);
Run;

Maybe we want to view observations 6-8 we can a second option at the end of our Proc print statements

Proc print data=cookie (firstobs=6 obs=8);
Run;

This tells SAS that the first observation we want to view is the 6th observation of a total of 8 observations we are looking at.

We can also tell SAS that rather than looking at all the variables we only want to see TRMT by adding a var statement to our Proc print.

Proc print data=cookie (obs=5);
  var TRMT;
Run;

SAS Programming/Coding TIP

1. Add comments to everything you do in SAS.  Use the *  ;  or /*  */  For example:

/*  To test whether I read my data correctly I will use the Proc Print to view the first 10 observations  */

Proc print data=cookie (obs=5);
Run;

2.  ALWAYS specify the data that you are using with your PROCedure.

3.  ALWAYS add a RUN; statement at the end of your DATA step and at the end of each PROCedure.  Makes your code cleaner and allows you to select portions of your code to run.

4.  Indenting the lines of code between the PROCedure name and the RUN; statement makes it easier to read your coding.

5.  SAS is NOT case sensitive with respect to your code, however, it is with your data.

6.  The more you code in SAS, the more apt you are to develop your own coding style.

S17 RDM Workshop: Data Preservation: The Legacy of your Research Data

Powerpoint Presentation used during Workshop – June 14, 2017

This workshop was a continuation of the 1st RDM workshop where we concentrated on creating new datasets with variable names that matched Best Practices for reading your data in a Statistical analysis package.  This workshop looks at how do you handle data that has been passed on to you from another individual, and all the challenges that accompany “inheriting” or acquiring data.

There were 2 exercises in this workshop.  The first exercise, you were provided a small dataset and a number of variables and titles for each.  You were asked to determine what types of questions you would need answered before you were able to work with this dataset effectively.  Below are links to the 4 datasets along with a sample of questions that you may need to ask.

Exercise1_Group1

Exercise1_Group2

Exercise1_Group3

Exercise1_Group4

The second exercise that was conducted in this workshop, was one where you were provided with one of two directories that held a number of files.  You were asked to review the directory structure and the file naming conventions, and provide a new one that was consistent with the recommended Best Practices presented in the workshop.  Below are links to proposed answer keys to both directories.

Group1_folder_directory_answersheet

Group2_folder_directory_answersheet

To reiterate that the primary goal of these workshops, is that you take all the recommended Best Practices presented and implement them with your own data

ARCHIVE: S17 RDM Workshop: Best Practices for entering your Research Data using Excel

Powerpoint Used during the 20170607 workshop

Commonly Used Statistical Packages

For the purposes of this workshop, the following statistical packages were reviewed:

  • SAS
  • SPSS
  • Stata
  • R
  • Matlab

It is recognized that there are many more available and used in the OAC community.  If you have questions regarding other packages not included here, please email oacstats@uoguelph.ca .

Commonly Used Statistical Packages:  Variable name restrictions and limits

LENGTH OF THE VARIABLE NAME

SAS – 32 characters long
SPSS – 64 bytes long
• 64 characters in English
• 32 characters in Chinese
Stata – 32 characters long
R – 10,000 characters long
Matlab – 63 characters lo

1ST CHARACTER OF THE VARIABLE NAME

SAS – 1st character MUST be:
• a letter (English) OR
• an underscore “_”
SPSS – 1st character MUST be:
• a letter (English) OR
• an underscore “_” OR
• “@”,“#”,“$”
Stata – 1st character MUST be:
• a letter (English) OR
• an underscore “_”
R – NA
Matlab – 1st character MUST be:
• a letter

BLANKS IN VARIABLE NAMES

SAS – NO Blanks!
SPSS – NO Blanks!
Stata – NO Blanks!
R – NO Blanks!
Matlab – NO Blanks!

SPECIAL CHARACTERS IN VARIABLE NAMES

SAS – NO Special characters with the exception of:
• “_”
SPSS -NO Special characters with the exception of:
• “_”
• “.”
• “@”
Stata -NO Special characters with the exception of:
• “_”
R -NO Special characters with the exception of:
• “_”
• “.”
Matlab – NA

CASE IN VARIABLE NAMES

SAS – Mixed case – for presentation only
SPSS – Mixed case – for presentation only
Stata – Mixed case – for presentation only
R – Mixed case – for presentation only
Matlab – Case sensitive

NAMES/WORDS TO AVOID IN VARIABLE NAMES

SAS – SAS keywords
SPSS – SPSS Reserve words
Stata – NA
R – R function words
Matlab – Function names

GENERAL NOTES ABOUT VARIABLE NAMES

SAS – Libref names can only be 8 characters long
SPSS – #variable – is a scratch variable used in syntax
• $variable – is a system variable
• Do NOT end variable name with a “.” OR “_”
Stata – NA
R – R Community recommends that you develop a naming convention for your data
• Use of “_” is faster (10-20%) than the use of “.”
Matlab – NA

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Names

LENGTH RECOMMENDATION

  • Maximum length: 32 characters
  • Keep the variable names short and use a variable label to provide more information. Remember you need to type these variable names in and you will need to remember them.

1ST CHARACTER OF A VARIABLE NAME

  • ALWAYS start variable names with a letter

VARIABLE NAMES AND SPECIAL CHARACTERS

  • Numbers may be used anywhere in the variable name AFTER the first character
  • Only use underscores “_”
  • Do NOT use BLANKS – replace blanks with an underscore “_”

CASE

  • Use lowercase
  • Case doesn’t matter for most packages.
  • If you are using MatLab – please be aware that the variable names are case sensitive – if you use lowercase as a Best Practice you won’t forget which ones are Capitals and which ones are NoT.

FAMILIARITY WITH STATISTICAL PACKAGE NOMENCLATURE

  • As you work with a particular package you will become familiar with keywords or reference words that are reserved for the program to use.
  • As a general rule keep away from Statistical terms as variable names.  If you REALLY want to use “mean” qualify it with your data, so wt_mean or concentration_mean

Commonalities across the Statistical Packages – Recommended Best Practices for Excel – Variable Labels

Variable names are often short and may not reflect the contents of the data collected.  Trying to create a variable name that is a descriptive summary of the data can be extremely challenging.  Recommendation is to create short, concise variable names and to create variable labels that are descriptive for each variable name.

Variable name:  wt28

Variable label: Weight (kg) at 28 days of age 

CREATING VARIABLE LABELS – SAS

Data first;
  Infile …
  Input …

 
  Label wt28 = “Weight (kg) at 28 days of age”;
Run;

 CREATING VARIABLE LABELS – SPSS

  1. Variable View in the SPSS Data viewer
    • Find the variable called wt28
    • In the Column called Label – Type: Weight (kg) at 28 days of age
  2. Syntax Window:

VARIABLE LABELS

Wt28 “Weight (kg) at 28 days of age”.

CREATING VARIABLE LABELS – R

Apply the appropriate function for the space you are working in.  For instance the Dataframe, Vector, etc..

Lapply function

CREATING VARIABLE LABELS – STATA

label variable wt28 “Weight (kg) at 28 days of age”

 

CREATING VARIABLE LABELS – MATLAB

T.Properties.VariableDecsriptions{‘wt28’}=”Weight (kg) at 28 days of age”;