Agricultural Statistics

ARCHIVE: W18 SAS Workshop: Merging datasets and creating new variables

PDF copy of the SAS code used in the workshop

This workshop will walk through merging datasets and creating new variables in SAS. You can choose to do all these steps in Excel and many people do. But sometimes it can be easier to do it all in SAS. Use Excel to enter your data, but then let SAS do the data manipulations and analysis. One of SAS’ strengths is in data manipulation!

Let’s start by bringing the data into SAS. Here is a link to an Excel file called Wksp2_data. Download the file and save it on your computer. There are 4 worksheets in this file. I will leave it up to you as to how you bring the data into SAS, but you will need to bring all 4 sheets in as 4 different SAS datasets. Let’s call the January worksheet – January, the February one February, and so on.

End goal: 4 SAS datasets: January, February, March, April

Review your LOG window after you bring each one in to make sure you haven’t missed anything. Then run a Proc Print to ensure the data looks like you were expecting it – in other words, it should look like the Excel worksheets.

Best Practice Note: Add a comment line before each Proc to describe what you’re doing! This way you’ll remember when you go back to review your syntax.

Merging Datasets

When you think about merging datasets there are 2 ways that you would want to merge. Across or down.

Across or adding Variables

We have 3 months of data where the IDs are the same: January, February, and March. We want to add the weights taken from February and March and add them to the end of the January file. So we’re adding variables in this case.

To accomplish this we need to take 2 steps. The first is to sort each dataset to make sure they are in the same order.

Proc sort data=January;
by ID;
Run;

Proc sort data=February;
by ID;
Run;

Proc sort data=March;
by ID;
Run;

Step Two:

Create a new Dataset for the merged data. Remember that the Data statement saves the data using the name you give it – so let’s call it jan_mar – for January to March.

Data jan_mar;

We’re then going to tell SAS that we want to merge the 3 datasets and we want to merge them by ID.

Data jan_mar;
merge january february march;
by ID;
Run;

What does your LOG say? Is it right? What is the best way to make sure SAS has done what you wanted it to do?

What happened? How should we fix this?

We had 3 files that had the same variable names in each. So merging them, by adding variables didn’t really work, because we were not adding new variables, but we were replacing the contents of the variables month and weight. To fix this, we need to call the variables something different in each of the months. I will add the _jan, _feb, _mar, and _apr to the weight variable in dataset. I will also change the variable month month_1, month_2, etc…

Make these changes and rerun. Did it work this time?

You should now see where there are missing data too. Something that was not apparent the first time we ran this.

Adding Observations or Down

We have a 4th dataset that contains weight measurements taken in January and April, but you’ll notice that these belong to individuals who were not included in the first 3 data files of the trial. So we will need to add these to the bottom of the dataset currently called jan_mar.

We already have a file called April that contains the April data and now we have one called jan_mar that contains the merged data from January to March. Since these individuals were not included in the original data, we do not need to sort them, since ID plays no role here.

To add the data to our merged dataset we use the SET command rather than the MERGE command:

Data jan_apr;
set jan_mar april;
Run;

Proc print data=jan_apr;
Run;

Creating New Variables

We now have all the weight data for the individuals in our trial and would like to calculate the individual weight gains from January to March. As noted before we can calculate these in Excel, but let’s use SAS to do it, especially since we have a new merged dataset.

In order to calculate the weight gain, we will be touching the data, and whenever we touch the data, we need to work within a DATA step. So let’s create a new dataset and call it jan_apr_wtgain.

Since we will be using a dataset that is already available to us in SAS we use the SET command again to recall the jan_apr dataset.

Data jan_apr_wtgain;
set jan_apr;

Now we can create our new variable. Let’s call it wtgain and it will be the difference between the weight taken in January (weight_jan) and the weight taken in March (weight_mar).

Data jan_apr_wtgain;
set jan_apr;

wtgain = weight_mar – weight_jan;
Run;

Run a Proc Print to check your data. Did it work?

Recoding a variable

Sometimes we have a variable that we want to recode – so in our case we are going to create a new variable called wtclass that will take the weights measured in January and put them into 3 weight classes: 1 = 13-16; 2 = 17-20; 3 = 21-24

There are a number of different ways to accomplish this, I will post one here and depending on time in the workshop, I will show you others. I will also include the other options in the accompanying SAS syntax – to be posted after the workshop.

We are working with data again, so need to work within a DATA step. Let’s use the jan_apr_wtgain dataset.

Data jan_apr_wtgain;
set jan_apr;

wtgain = weight_mar – weight_jan;

if weight_jan < 17 then wtclass = 1;
if weight_jan ge 17 and weight_jan < 21 then wtclass = 2;
if weight_jan >20 then wtclass = 3;
Run;

Run a Proc Print and see what happened. Are we happy with these results?

What can SAS do?

There are so many more manipulations that SAS can do. These are just a couple of them and ones that may help you out as you start using SAS for your own research data. Check out others that have been discussed in the past on the SASsyFridays blog

Crimes of Statistics: Longitudinal Studies or Repeated Measures – What are the implications?

What is a longitudinal or repeated measures study?

Let’s take a little step back first and recall the conversation we had back in the Fall semester – experimental unit – the unit to which the treatment is applied to. This is a VERY crucial concept and definition when we talk about a repeated measures study.

Like the term says repeated measures, the researcher is taking the same measurements on “some unit” repeatedly. We often think of this in terms of time. I’m going to take weight measures or height measures every month during the summer growing period. The question that needs to be answered is “What” unit? Is it the same experimental unit? If yes, then we have a classic repeated measures study. If no, then we have reps.

Longitudinal study is a term often used in the social sciences. We tend to think of a longitudinal study – again in terms of time – and usually in the context of a longitudinal survey. The experimental units, in this case, are the survey respondents, and they will be answering the same survey several times in a year or across many years.

Bottonline, a longitudinal or repeated measures study is a study where the experimental unit is measured more than once.

Examples of longitudinal or repeated measures study

An educational survey where students answer the survey after high school, after their 1st year of University, 2nd year, 3rd year, and after graduating
A dairy lactation study, where the same cows in a herd are milked and measured each day during their first 3 lactations.
A new diet trial, where feed consumed by dogs is measured every day for a 21 day trial
A new herbicide trial, where plots in a field are measured every week for weed counts
A soil texture trial, where texture is measured at 4 depths of a soil core.

Challenges with a longitudinal or repeated measures study

The goal of many studies is to examine or determine whether differences exist between treatments of interest in the study. We gather our data and conduct the statistical analysis to look at the variation between our treatments in that study. When we enter our data, chances are we will enter an observation every time we take a measurement. For example, if we have 20 dogs on our trial and we are measuring their feed intake for 21 days, we will have 420 lines of data. OR we may have a dataset that has 20 lines, with each line containing 21 measures for each dog. Either way, we have 21 measurements for each experimental unit. The big challenge of a repeated measures analysis is to recognize that the variation within the experimental unit, dog in this example, needs to be accounted for, before looking at the differences between the treatments, diets in this case.

If I use my data cloud visual to try and explain. We have 420 measures in our experiment – let’s throw this data up and think of it as a big cloud of data. With our analysis, the goal is to partition that cloud into the treatment groups and hopefully be able to see distinct treatment groups. However, we have 21 measurements for each dog and we want to ensure that when we start to look at treatments effects, that we keep those 21 measures for the dog together as a unit. Remember we only have 20 experimental units and that’s where we should be concentrating when we look for treatment effects. We do NOT have 420 experimental units!

No matter what statistical software package you use, there will be options to identify your experimental unit! You need to find it.

Can you think of trials or studies that you have done in the past or will be doing in the future, is it a longitudinal or repeated measures study?

Questions to ask to help you determine if you have a longitudinal or repeated measures study:

What is your experimental unit?
Is your experimental unit being measured more than once?

Name

Change in starting time!

Good day everyone!

Please note that the Crimes in Statistics and Data Visualization sessions will now start at 9am on Wednesdays. We will meet in ANNU Rm 101, but start 30 minutes earlier.

For more information on the dates, please review the schedule.

SPSS: Descriptive Statistics and Charts

Last week in the RPD6380 class, we talked about how to enter data into an Excel spreadsheet, and then opening it into the SPSS software.

As a recap:

Create variable names at the top of each column in Excel to match your variables/questions. Use the Best Practices to naming your variables:
- Keep it short (Maximum 32 characters in SPSS)
- Start with a letter – can contain numbers
- NO funny characters – %,$,#, etc…
- NO blank spaces – use an _ if you want
Save your file in Excel
To open in SPSS
- File
- Open Data
- Navigate to where you saved your file
- Change File Type to Excel
- Select your file and click Open
- Answer the questions in the dialogue box
Make sure you save your data in SPSS

Variable Labels

In the Variable View of SPSS, take the time to fill in the Labels for each variable. This way you won’t have to remember what those shortened variable names are in a couple of months or years.

Value Labels

We worked through an exercise where we coded some of our data. Males/Females were m/f or 1/2. Be sure to add all these labels in the Values section of the Variable View in SPSS. The time you spend doing this at the start of your research will save you a LOT of time when you do your analysis.

Missing Values

Be sure to add any missing codes to the Missing column in the Variable View

Descriptive Statistics

At the beginning of any statistical analysis, learning more about your data is a great place to start. Descriptive statistics are essentially that – they describe your data, or they summarize your data to give you a good, solid base understanding of what you have collected. The type of descriptive statistics you will conduct will depend on the type of variable you have. Remember the 3 types of variables that SPSS distinguishes between?

Scale – a continuous piece of information, also referred to as Interval or Ratio. Examples: age, weight, height
Nominal – a categorical piece of data – there is NO relationship between the categories. Examples: religion, colour, gender
Ordinal – a categorical piece of data – this time there is a relationship or order to the categories. Examples: Year of study, age group, likert scales

Each of these data types will use a different type of descriptive statistic. For instance, calculating the mean of colour makes no sense at all, but a frequency count of colour does work.

Frequency

To calculate the frequency of a categorical variable (nominal OR ordinal) in SPSS:

Analyze
Descriptive Statistics
Frequencies
- Select the variables in question and drag to the right hand side
  - As an example, select Income Category
- Click OK to run

You should now have a frequency table of the variable, Income Category

The lists the categories of the variable, in this case: Below $25; $25-$49; $50-$74; $75+. If you had not provided the value labels, you would see 1; 2; 3; 4 as the categories with no explanation as to what they represent.

The table lists Frequency – actual count of observation in each category; Percent – percent of observations as a total; Valid Percent – this will change if you have missing observations. The Valid Percent is the percentage of observations that have values for Income Category; Cumulative Percent.

Try:

Run the Frequency procedure on the variable called Internet
Can you describe what you see?

Mode

Mode is the value in the data that appears the most. So let’s switch variables and run a frequency on the variable Job Satisfaction. When you run the frequency you have a table that shows you how many people answered each of the 5 levels of this Likert Scale:

Highly dissatisfied = 1109
Somewhat dissatisfied = 1268
Neutral = 1393
Somewhat satisfied = 1406
Highly satisfied =1224

By looking at these results I can see that Somewhat satisfied appears to be the category that people selected the most. But let’s get SPSS to do the hard work for us and confirm whether this is correct or not.

To obtain the MODE of a variable:

Analyze
Descriptive Statistics
Frequencies
- Select the variables in question and drag to the right hand side
  - As an example, select Job Satisfaction
- Click on the Statistics button on the right
  - Select Mode
  - Click Continue
  - Click OK

You should now see the Mode in the first table of the Frequency output.

Try:

What is the mode?
Would you calculate the MODE on a variable such as income? Why or Why not?

Median

The median of a variable, is the middle value. So if you have an even number of categories, there will be no median or middle value, but if you have an odd number you will see it.

To obtain the MEDIAN in SPSS, follow the same instructions as the MODE, but select the MEDIAN in the Statistics dialogue box.

Try:

What is the median for Job Satisfaction?
What is the median value for Level of Education?

Mean

The mean or average is calculated on a scale variable or continuous variable. It just doesn’t make sense to calculate the mean of a categorical variable.

To obtain the MEAN in SPSS:

Analyze
Descriptive Statistics
Descriptives
- Select the variable in question and drag to the right hand side
  - use income as an example
  - Click OK to run

You should now have a table with N, Minimum, Maximum, Mean, and Standard Deviation for the household income variable. These are the default values you obtain when you run this analysis. But, what happens if you want the Sum or the Standard Error of this variable?

Analyze
Descriptive Statistics
Descriptives
- Select the variable in question and drag to the right hand side
- Select the Options button – this will open another dialogue box that has a list of statistics to select from
  - Select Sum and S.E. mean (standard error of the mean)
- Click Continue
- Click OK to run

Your output table will now contain these added statistics.

Try:

Select another Scale variable from your dataset and calculate the mean, variance, and standard deviation.

Explore Function in SPSS

Sometimes you may want to determine what the mean household income by marital status or by another categorical variable. Till now, we’ve been looking at the entire dataset. There are a few ways to do this, but the most direct way is to use the Explore function in SPSS.

Analyze
Descriptive Statistics
Explore
- In the Dependent List box, add the variables for which you would like to calculate the means – for example: household income
- In the Factor List box, add the variable by which you would like to see the means for – for example: marital status
- Click Ok to run.

You will now see a much larger table than we have seen to date. SPSS provides you with a long list of descriptive statistics for household income by each level of marital status.

You will also see a Stem and Leaf plot along with a Boxplot to provide you with a sense of the distribution of the data. More information to help you get a better feeling for the data that you are working with.

Summary

The common descriptive statistics that are used include: frequency, median, mode, mean, and measures of variation (standard deviation, standard error, etc..). Each of these statistics should be run on the appropriate types of data – keep in mind, that a frequency on a variable such as age will give you a long table with meaningless information.

Chart Builder in SPSS

Numbers and statistics can be fun, but sometimes putting these numbers into context with a chart or graph may reach a broader audience of understanding. What do I mean by that? How many of you will remember a number vs how many of you will remember a graph that shows a trend?

Building charts in SPSS is quite straightforward and fun! You’ll see!!

Let’s start by creating a barchart for our job satisfaction variable. We want to see a bar for each level and we want to see the count.

In SPSS:

Graphs
Chart Builder – this will open a dialogue box
- Notice on bottom half – a gallery of all the different types of charts you can create in SPSS.
- We want a simple barchart
  - Select bar
  - Then double-click on the first barchart listed
  - Once you do this you should see the skeleton of a bar chart appear in the top half of your dialogue box.
  - All you need to do now, is to drag and drop the variables where they are appropriate.
  - For this example:
    - Select Job satifisfaction and drag it to the x-axis
    - On the right, you may see an Element Properties dialogue box (if you do not see this – Click on the Element Properties button to open it).
    - Note that under Statistics, Count is selected – this is what we want. But click on this to see what other statistics are available.
  - To create the graph Click OK

You should now see a very plain barchart that matches the Frequency counts we created earlier.

Let’s create a chart that shows the average income for each level of job satisfaction. I’m curious to see whether the folks that are not satisfied with their job have a lower average income.

So, let’s start this again:

Graphs
Chart Builder – this will open a dialogue box
- Select Barchart again
- Drag and drop Job Satisfaction to the x-axis
- Now drag and drop Household income to the y-axis
- Notice how the Statistic changed to Mean. This is what we want.
- Let’s run in by clicking OK

Hmm… now that’s an interesting graph!

One last piece missing from this graph – error bars! Whenever you have charts with means, you should ALWAYS provide some measure of variance. So let’s add some error bars and we’ll try standard error.

Graphs
Chart Builder – this will open a dialogue box
- Select Barchart again
- Drag and drop Job Satisfaction to the x-axis
- Now drag and drop Household income to the y-axis
- Ensure that the statistic is mean
- Under the statistics box in the Element Properties box, check the Display Error Bars box
  - Now you have a few options, as stated above let’s use the Standard Error option – select Standard Error
  - Click Apply
- Click OK to run chart

Providing the error bars gives the reader a “fuller” picture of the data. Although in this case it does not change the story!

Try:

Create a barchart that shows the mean household income by job satisfaction for the 2 levels of marital status. Be sure to include error bars.
What question does this barchart answer?

More charts

I used the example of a barchart, but the more you use the ChartBuilder, you can see how straightforward it is to create charts in SPSS. Try playing around with a different chart and see what happens.

Summary

Barchart for counts
Barchart to show means of groups
Side-by-side barchart to show means of group

Name

ARCHIVE: W18 RDM Workshop: Store and Analyze

This workshop is the second in a series of 4 offered in partnership with Carol Perry, Associate Librarian Research and Scholarship. These workshops are hands-on and have exercises associated with each aspect being covered in the workshop.

This workshop explores the areas of data storage and data analysis in the context of Research Data Management. The powerpoint presentation is available here, please review for more information and contact either Carol Perry or Michelle Edwards for questions.

Name