ARCHIVE: W18 SAS Workshop: Merging datasets and creating new variables

PDF copy of the SAS code used in the workshop

This workshop will walk through merging datasets and creating new variables in SAS.  You can choose to do all these steps in Excel and many people do.  But sometimes it can be easier to do it all in SAS.  Use Excel to enter your data, but then let SAS do the data manipulations and analysis.  One of SAS’ strengths is in data manipulation!

Let’s start by bringing the data into SAS.  Here is a link to an Excel file called Wksp2_data.  Download the file and save it on your computer.  There are 4 worksheets in this file.  I will leave it up to you as to how you bring the data into SAS, but you will need to bring all 4 sheets in as 4 different SAS datasets.  Let’s call the January worksheet – January, the February one February, and so on.

End goal:  4 SAS datasets:  January, February, March, April

Review your LOG window after you bring each one in to make sure you haven’t missed anything.  Then run a Proc Print to ensure the data looks like you were expecting it – in other words, it should look like the Excel worksheets.

Best Practice Note:  Add a comment line before each Proc to describe what you’re doing!  This way you’ll remember when you go back to review your syntax.

Merging Datasets

When you think about merging datasets there are 2 ways that you would want to merge.  Across or down.

Across or adding Variables

We have 3 months of data where the IDs are the same: January, February, and March.  We want to add the weights taken from February and March and add them to the end of the January file.  So we’re adding variables in this case.

To accomplish this we need to take 2 steps.  The first is to sort each dataset to make sure they are in the same order.

Proc sort data=January;
  by ID;
Run;

Proc sort data=February;
  by ID;
Run;

Proc sort data=March;
  by ID;
Run;

Step Two:

Create a new Dataset for the merged data.  Remember that the Data statement saves the data using the name you give it – so let’s call it jan_mar – for January to March.

Data jan_mar;

We’re then going to tell SAS that we want to merge the 3 datasets and we want to merge them by ID.

Data jan_mar;
  merge january february march;
  by ID;
Run;

What does your LOG say?  Is it right?  What is the best way to make sure SAS has done what you wanted it to do?

What happened?    How should we fix this?

We had 3 files that had the same variable names in each.  So merging them, by adding variables didn’t really work, because we were not adding new variables, but we were replacing the contents of the variables month and weight.  To fix this, we need to call the variables something different in each of the months.  I will add the _jan, _feb, _mar, and _apr to the weight variable in dataset.  I will also change the variable month month_1, month_2, etc…

Make these changes and rerun.  Did it work this time?

You should now see where there are missing data too.  Something that was not apparent the first time we ran this.

Adding Observations or Down

We have a 4th dataset that contains weight measurements taken in January and April, but you’ll notice that these belong to individuals who were not included in the first 3 data files of the trial.  So we will need to add these to the bottom of the dataset currently called jan_mar.

We already have a file called April that contains the April data and now we have one called jan_mar that contains the merged data from January to March.  Since these individuals were not included in the original data, we do not need to sort them, since ID plays no role here.

To add the data to our merged dataset we use the SET command rather than the MERGE command:

Data jan_apr;
  set jan_mar april;
Run;

Proc print data=jan_apr;
Run;

Creating New Variables

We now have all the weight data for the individuals in our trial and would like to calculate the individual weight gains from January to March.  As noted before we can calculate these in Excel, but let’s use SAS to do it, especially since we have a new merged dataset.

In order to calculate the weight gain, we will be touching the data, and whenever we touch the data, we need to work within a DATA step.  So let’s create a new dataset and call it jan_apr_wtgain.

 

Since we will be using a dataset that is already available to us in SAS we use the SET command again to recall the jan_apr dataset.

Data jan_apr_wtgain;
  set jan_apr;

Now we can create our new variable.  Let’s call it wtgain and it will be the difference between the weight taken in January (weight_jan) and the weight taken in March (weight_mar).

Data jan_apr_wtgain;
  set jan_apr;

  wtgain = weight_mar – weight_jan;
Run;

Run a Proc Print to check your data.  Did it work?

Recoding a variable

Sometimes we have a variable that we want to recode – so in our case we are going to create a new variable called wtclass that will take the weights measured in January and put them into 3 weight classes:  1 = 13-16; 2 = 17-20;  3 = 21-24

There are a number of different ways to accomplish this, I will post one here and depending on time in the workshop, I will show you others.  I will also include the other options in the accompanying SAS syntax – to be posted after the workshop.

We are working with data again, so need to work within a DATA step.  Let’s use the jan_apr_wtgain dataset.

Data jan_apr_wtgain;
  set jan_apr;

  wtgain = weight_mar – weight_jan;

  if weight_jan < 17 then wtclass = 1;
  if weight_jan ge 17 and weight_jan < 21 then wtclass = 2;
  if weight_jan >20 then wtclass = 3;
Run;

Run a Proc Print and see what happened.  Are we happy with these results?

What can SAS do?

There are so many more manipulations that SAS can do.  These are just a couple of them and ones that may help you out as you start using SAS for your own research data.  Check out others that have been discussed in the past on the SASsyFridays blog

ARCHIVE: SAS Workshop: Introduction to the Program

PDF version of Notes

Available Versions of SAS

  • PC Standalone Version – PC-SAS
    • Available for Windows ONLY – if you’re using a Mac, you will need to have a VM to emulate Windows to run this version
    • Available through CCS Software Distribution Centre – $135.00 for a new license and $75/year renewal license. This information was downloaded on June 3, 2019, please check https://guelph.onthehub.com/WebStore/Welcome.aspx for updated pricing and access information or email 58888help@uoguelph.ca for more information
  • Animal Biosciences department ONLY
    • Access the server version of PC-SAS
  • SAS University Edition
    • This is free for all academics to use. You can download the free version from https://www.sas.com/en_ca/software/university-edition.html
    • This is available for both Mac and Windows users
    • Please note, that you will be required to update this version every year.  SAS will send you a reminder notice, approximately 1 year from your installation date.
  • SAS OnDemand
    • This is also free for academics
    • This is SAS’ in the cloud version of the University Edition
    • Environment is the same as the University Edition, the difference is that you are using the SAS service in the Cloud, all your files are stored in the Cloud and not on your local system, and you are using their computer resources NOT your own system – accessed through a web browser with your own personal login

What Parts of SAS do you have access to?

SAS is an extremely large and complex software program with many different components.  We primarily use Base SAS, SAS/STAT, SAS/ACCESS, and maybe bits and pieces of other components such as SAS/IML.

SAS University Edition and SAS OnDemand both use SAS Studio.  SAS Studio is an interface to the SAS program and contains the following components:

  • BaseSAS – base SAS programming, DATA Step
  • SAS/STAT – the PROCs used for statistical analyses
  • SAS/IML – SAS’ matrix programming language
  • SAS/ACCESS – allows you to interact with different data formats
  • Some parts of SAS/ETS – time series analysis

If you are using the PC or Server SAS versions, you may have access to more than the modules listed above.  To see exactly what you have access to, you can run the following code:

Proc Setinit;
Run;

You will see the components available to you listed in the log window.

SAS_setinit_log_window_results.png

Also note the additional information available to you:

  • License information
  • Expiration date – very handy to be aware of, especially if you are running your own copy of your PC
  • SAS components available to you

What does SAS look like?

There are a number of components to the SAS interface:

  • Results and Explorer windows to the left
  • Editor, Log, Output, and Results Viewer windows to the right, taking up most of the screen

SAS_Windows_interface_Contents_window_Editor_Log_Windows.jpg

What do each of these windows do?

  • Results Window –  a Table of Contents for all of your results.
  • Explorer Window – similar to Windows Explorer – allows you to navigate SAS libraries and files
  • Editor Window – this is where you will spend most of your time, writing and editing program files
  • Log Window – this window is extremely helpful, think of it as your best friend in SAS, it tells you what SAS has done every step of your program and processing
  • Output Window – SAS versions 9.2 and earlier, use this window to display all results and output.  SAS 9.3 and higher use a new window called the Results Viewer.  All the results are presented in an HTML format.

How does SAS work?

SAS is divided into 2 areas:

  • DATA step
  • PROCs (short for PROCedures)

DATA step is all about data manipulation – one of the key strengths to SAS
PROCs – this is where you will find most of your statistical procedures.

How do you get data into SAS?

The primary reason we use SAS is to perform statistical analyses on some data.  However, we need to ensure that the data we have collected is brought into SAS correctly.  I’m sure you’ve heard of “garbage in, garbage out”?  This cannot be more truer than when you collect data and bring it into a statistical package.

There are different ways to bring data into SAS.  I will try to review and provide my thoughts on 3 different ways I see my students performing this task.  However, before we import data into any software package, we need to ensure the data is “clean” and in a format that will be accepted into the package.  So let’s talk about the most common way researchers enter their data – EXCEL.

Using Excel to enter data and Statistical Software packages

Most people use Excel to enter their data and that’s great!  The look of it is neat, ordered and we can do quick summaries, such as means and sums.  We can also make Excel look pretty by adding colours, headings, footnotes, or maybe notes about what we did and how.  In the end, Excel can be a very versatile tool.  But, we need to keep in mind that Excel is NOT a statistical package and that we are using it to collect our data.  That being said, I recognize many people use it for more than it was set out to be.

Let’s take a look at an example of how Excel is used.

2018 Trial Data

Everyone uses Excel differently when entering data.  This file is a very simple example.  Many people will highlight cells or add comments, etc…  Every file will need to be “cleaned” before it can be used in SAS.  These are recommended steps to clean any Excel file.

Recommended steps to clean an Excel file:

  • Copy the entire sheet into a blank worksheet.  This allows you to keep the formatted version while working on a clean version.
  • Label the new worksheet SAS or something that makes sense to you.  This way when we import the data you will know which worksheet contains the clean data.
  • Remove all the formatting.  In Excel, Click on the CLEAR button and select Clear Formats.  This will remove all Excel formatting from the worksheet.
  • The top row of the Excel file needs to contain the name of the variables you wish to use in SAS.  Note that some files may have titles and/or long descriptions at the top of the worksheet.  These need to be deleted.
  • The top row of the Excel file needs to contain the name of the variables you wish to use in SAS.  You may need to modify the headings of the columns.
  • The variable names are ones that will have a significance to you.  Please DOCUMENT these changes so you know what is contained in your dataset!  I will provide more information on Variable Labels and Value Labels in a follow-up post
  • Don’t forget to save your Excel file!
  • If there are any notes at the bottom of your worksheet or anywhere else in the worksheet – you will need to delete these.

RECAP:

  1. Copy data into new worksheet
  2. Rename worksheet for easy identification later
  3. Clean variable names in the first row
  4. Second row contains your data and NOT blanks

TIPS:

SAS naming conventions:

  • variable names do not contain any “funny” characters such as *, %, $, etc…
  • variable names begin with a letter
  • variable names may contain a number, but cannot begin with a number
  • NO spaces allowed!  You may use _ in place of a space

IMPORTING EXCEL FILES INTO SAS – works best with individual PC-SAS license

Using the IMPORT feature in SAS is probably the easiest way of bringing data into the SAS program.  To import the data follow these steps:

  1. In the SAS program – File -> Import Data
  2. You will now answer each step of the Import Wizard.
  3. With SAS 9.2, you will need to save your Excel files as the older 97-2003 .xls format.  This version of SAS does NOT recognize the .xlsx ending for Excel
  4. Browse your computer to find and select your Excel file
  5. Select the worksheet in the file using the dropdown box in SAS.  This is why I suggested earlier to call it SAS or something you will remember
  6. This next step can be tricky.  Leave the Library setting to WORK.  In the Member box provide SAS with a name for your datafile to be saved in SAS.  For this example let’s call it TRIAL
  7. The next step is optional!  If you are planning on importing more files that have the same structure or where your answers to the Wizard will be the same, this step allows you to save the program (or syntax) that SAS creates to import the file.
  8. Finish and your file is now in SAS
  9. Check the Log Window

 

COPY AND PASTE DATA INTO SAS

As much as I would like to discourage people from using this method of bringing data into SAS, it is a viable option about 95% of the time.  In most cases this method will work, however there is the odd case, about 5% of the time where this method will fail.

Let’s work through how we enter data into Excel and translate our steps into SAS

First thing most of use do when entering data into Excel is to create variable names or headings in the first row of Excel.  We then begin to type our data in the second row.  When we’ve completed entering the data or we have a page full of data, that’s when most of us remember to save the file.  Sound familiar?

In SAS we can do all of these steps using a DATA Step.  We will be creating a program or writing syntax in the SAS editor for this bit.  To start, SAS likes us to save our file FIRST, before we enter any data – contrary to what we traditionally do in Excel.  We start our program.

Data trial;

The first thing we did in Excel was label our columns – this is the second line of our SAS code:

Data trial;
Input ID trmt weight height;

The next thing we do in Excel is start entering our data.  In SAS, we first let SAS know that the data is coming by adding a datalines; statement in our code and then we enter our data.

Data trial;
Input ID trmt weight height;

Datalines;
13K Pasture 89 47

In order to complete our data entry in SAS, we need to let SAS know that there are no more data points and to go ahead and save the file.  To do this we add a “;” all by itself at the end of the data and a Run; to let SAS finish the data and save the file.

Data trial;
Input ID trmt weight height;
Datalines;
13K Pasture 89 47
;
Run;

When you run this code – you will receive a number of errors.  SAS loves numbers and the data we are trying to read in has letters and words in it.  We need to let SAS know that the variables we’ve called ID and trmt, are not numbers.  We do this by adding a $ after the variable names.

Data trial;
Input ID$ trmt$ weight height;
Datalines;
13K Pasture 89 47
;
Run;

Now that should work without any errors in the LOG window.

Rather than retyping all the data, we can copy it from Excel and paste it after the datalines statement.  As I noted above, this will work most of the time, but there are times where it does not work.  Why you may ask?  I suspect it is some hidden Excel formatting that plays havoc with SAS, but I cannot identify exactly what it is.  Just note that this method may fail at times.

READING DATA FROM A FILE

In order to read data that has been saved to a file, the INFILE statement must be used before the INPUT statement.   Think of it in these terms.  You need to tell SAS where the data is first (INFILE) before you can tell it what is contained inside the file (INPUT).  Another trick to remembering the order, the two statements are to be used in alphabetical order.

NOTE: Before we can read in a datafile into SAS, we need to save it in the proper format from Excel.  On a WINDOWS laptop/computer, in Excel, please select File -> Save As -> Save as type should be CSV(Comma Delimited).  On a Mac, in Excel, please select File -> Save As -> Type should be MS-DOS Comma Delimited.

Once you have a datafile that has been created in Excel or another program, and if that file is a text file, which means a file that only has data and spaces, then the INFILE statement will be only be used to tell SAS the location of the text file on the computer. Here is an example:

Data trial;
Infile “C:\Users\edwardsm\Documents\Workshops\SAS\trial.csv”;
Input ID$ trmt$ weight height;
Run;

A Comma Separated Values(CSV or Comma Delimited) File is one of the most common text files used for data today, probably more common than a text file.  If you use a text file, we assume that there are only empty spaces between the variable values.  With a CSV file there are commas (,) separating the values, so we need to tell SAS this.  This can be done by adding DLM (which is short for DELIMITER) = “,” at the end of the INFILE statement.

There is another aspect of our CSV files that we will need to tell SAS about.  When we are working in EXCEL and create our CSV files, we use the top row to list our variable names (to identify the variables).  This is fine, but again, we need to let SAS know that we don’t want it to start reading the data until the second row or whichever row your data starts in.  We do this by adding FIRSTOBS=2 at the end of the INFILE statement.  So we will have something that looks like this:

Data trial;
Infile “C:\Users\edwardsm\Documents\Workshops\SAS\trial.csv” dlm=”,”      firstobs=2;
Input ID$ trmt$ weight height;
Run;

With a CSV file, remember that we are using a “,” to separate the variable values or the columns that were in Excel.  What happens though, if we have commas in our data?  For example, instead of Chocolate cookies, we may have entered the data to show Cookies, chocolate.  If we leave the INFILE statement as it reads now, when SAS encounters one of those commas, it will move onto reading the next variable, which we know will fail or make a mess of our data.  To prevent this from happening we need to add the DSD option at the end of our INFILE statement.

And…  0ne last note about using the INFILE statement.  Quite often you will see one more option at the end of this statement, and one that I highly recommend:  MISSOVER.  Quite often when you use Excel to enter your research data, you will encounter times when you have no data.  Many people leave the cells blank.  When this happens at the end of a record or row in your datafile, SAS will see that blank and assume that the next variable value is on the next row.  Making a fine mess of reading in your data.  By adding the MISSOVER option at the end of the INFILE statement, you’re telling SAS that it’s fine that the cell is missing and to start the new row of the SAS dataset with the new row/record in Excel.

Data trial;
Infile “C:\Users\edwardsm\Documents\Workshops\SAS\trial.csv” dlm=”,” firstobs=2 missover dsd;
Input ID$ trmt$ weight height;
Run;

READING DATA FROM A FILE USING SAS STUDIO

When working on your own PC-SAS on your system, identifying where your files are, can be accomplished by looking through the Windows Explorer.  However, when you’re using SAS Studio, because it uses a Virtual Machine, and you run the SAS program through a web browser, finding the right place for your files can be challenging.  A few extra steps to reading your data from a file using SAS Studio.

    1. Upload files to SAS Studio.
      • This will place your files within the SAS Studio environment, so that it can see them.
      • To do this – right click on Files(Home)
      • Select Upload Files
      • Select your file and upload

SAS_Studio_Server_Files_Interface.png

2.  Your INFILE statement will need to know where your files are located. They are not on C:\….  they are within the SAS Studio environment.  To find them, right-click on the file and select Properties

SAS_Studio_Server_Files_Properties_Interface.pngSAS_Studio_File_Propertes.png

3.  Copy the information in the Location section to use for your INFILE statement.

Data trial;
    Infile “/home/edwardsm0/trial.csv” dlm=”,” firstobs=2;
Input ID$ trmt$ weight height;

Run;

Viewing the data in SAS

We’ve just imported our data and I see nothing!  What happened?  Did I do something wrong?  My log window says my data has been successfully imported, but where did the data go?

Once you’ve imported your data, SAS saved it in a dataset within its program. So think of is as a blackbox and somewhere in that blackbox is a dataset called TRIAL.  How do you go about viewing it?  Let’s use a PROCedure called PRINT.

PROC PRINT will show you your data in the Output window.

Proc print data=trial;
Run;

These statements will printout ALL the observations in your dataset.  Note when we say “printout” it prints to the screen and not to your printer.  Please note that specifying the dataset you are working with is an EXCELLENT habit to get into.  In this case we are interested in viewing the data contained in the TRIAL dataset – data=trial

To view only the first 5 observations in this dataset we can add an option at the end of the Proc print statement.

Proc print data=trial (obs=5);
Run;

Maybe we want to view observations 6-8 we can a second option at the end of our Proc print statements

Proc print data=trial (firstobs=6 obs=8);
Run;

This tells SAS that the first observation we want to view is the 6th observation of a total of 8 observations we are looking at.

We can also tell SAS that rather than looking at all the variables we only want to see TRMT by adding a var statement to our Proc print.

Proc print data=trial (obs=5);
  var TRMT;
Run;

SAS Programming/Coding TIP

1. Add comments to everything you do in SAS.  Use the *  ;  or /*  */  For example:

/*  To test whether I read my data correctly I will use the Proc Print to view the first 10 observations  */

Proc print data=trial (obs=5);
Run;

2.  ALWAYS specify the data that you are using with your PROCedure.

3.  ALWAYS add a RUN; statement at the end of your DATA step and at the end of each PROCedure.  Makes your code cleaner and allows you to select portions of your code to run.

4.  Indenting the lines of code between the PROCedure name and the RUN; statement makes it easier to read your coding.

5.  SAS is NOT case sensitive with respect to your code, however, it is with your data.

6.  The more you code in SAS, the more apt you are to develop your own coding style.

Tackling an analysis using GLIMMIX

So, you have some data and you want to analyze it using Proc GLIMMIX.  You have some data which you’ve collected and have a few treatments which you’d like to compare.  So how do you start this?

My goal is to provide steps to tackle these types of analyses, whether you are working with weed data, or animal data, or yield data.  I suspect I’ll be updating this post as we clarify these steps.

First Step – your experimental design

Ah yes!  Despite popular belief you DO have an experimental design!  Find it or figure it out now before you go any further.  Why?  Because your model depends on this!  Your analysis comes down to your experimental design.

Second Step – build your MODEL statement

You know what your outcome variable is, you know what your experimental design is, which means you know what factors that you’ve measured and whether they are fixed or random.  So…  you now know the basis of your MODEL statement and your initial RANDOM statement.

Third Step – expected distribution of your outcome variable

You already know whether your outcome variable comes from a normal distribution of not.  Chances are it is not, but what is it?  Check out the post on Non-Gaussian Distributions to get an idea of what distribution your outcome variable may be.  Think of it as the starting point.

Add this distribution and the appropriate LINK to the end of our MODEL statement.

Fourth Step – run model and check residuals

Remember that when we run the Proc GLIMMIX – we need to check our assumptions – the residuals!  How do they look?  How’s the variation between your fixed effect levels?  Homogeneous or not?  Are the residuals evenly distributed?  Are the residuals normally distributed?

Fifth Step – residuals NOT normally distributed

Is there another LINK for the DISTribution that you selected?  If so, please try it.

Sixth Step – fixed treatment effects not homogeneous

Now the fun begins.  To fix this one, we need to add a second RANDOM statement – essentially telling SAS that we need to it to use the variation of the individual treatment levels rather than the residual variation.  As an example, a RANDOM statement, for a design that has a random block effect, would be as follows:

RANDOM _residual_ / subject = block*treatment group=treatment;

Seventh Step – try another distribution

Now – we do NOT want you trying ALL the distributions possible – this just doesn’t make sense.  Remember you need to think back to the distribution possibilities for our outcome variable.  Please use the link provided in Step 3 as a guide.  However, one distribution I have discovered works for many situations is the lognormal distribution.  At the end of your model statement you would add / DIST=lognormal LINK=identity.

Another option is to transform the data in the GLIMMIX procedure.  The one transformation that researchers like is the arcsine square root transformation.  To try this one please use the following code.

Proc GLIMMIX data=first;
trans = arsin(sqrt(outcome));

model trans = …;

Run;

Last Step – results will not always be perfect!

You will do the best that you can when analyzing your data.  But please recognize that you may not be able to match all the assumptions everytime.  Go back, review your data, review your experimental design, to ensure you have the correct proc GLIMMIX coding.

As I’ve noted earlier, as we continue to learn more about GLIMMIX this post will probably be updated to include and/or refine these steps.

Name

S17 SAS Workshop: Proc GLM, Proc MIXED, Proc GLIMMIX – an overview – RCBD

Notes For the CRD and RBCD Workshop – PDF file

This workshop will look at a Randomized Complete Block Design (RCBD) in Proc GLM, Proc MIXED, and Proc GLIMMIX.  The goal is to review the coding similarities & differences, along with the differences & similarities in the respective outputs.

The SAS program can be found here – please note that it is a PDF file

Proc GLM Results

Proc MIXED Results

Proc GLIMMIX Results