ARCHIVE: W18 SPSS Workshop: Merging datasets and creating new variables

This workshop will walk you through the steps in SPSS to help you merge datasets, create new variables, and recode variables in your current dataset.  Many of these functions are easily performed in Excel, but I’d like to show you how to take advantage of some of the data manipulation options available to you in SPSS.  Pick the method that is the most comfortable for you, when working with your data.

Let’s start by downloading a dataset to use for this workshop, a dummy dataset I created in Excel.  Download the file and save it on your computer.  There are 4 worksheets in this file.  Weight measures were taken on 25 individuals in January, February, and March.  For the month of April we have an additional individuals with weight measures taken in January and April.  Our goal is to create one SPSS dataset that contains all 4 Excel worksheets.

First, let’s 0pen each worksheet in SPSS and save them using the dataset names of January, February, March, and April.  Remember that although the Excel file can hold several worksheets with different data, statistical packages such as SPSS and SAS, cannot bring in the entire file at the same time, you will need to bring in each worksheet separately.

You should now have 4 SPSS datasets saved on your computer called:  January.sav, February.sav, March.sav, and April.sav

Merging Datasets

There are 2 ways to merge datasets in SPSS:  Add Cases or Add Variables.  I really like how SPSS states these, since it makes it clear as to how you are adding the data from one file to the next.  Add Cases – means that you will be adding more observations to the current dataset.  If you think about this, this means that you would be adding more respondents, more animals, more plots, etc…  Essentially adding more observations that are unique to what is in your current dataset.

Add Variables means you have the same set of observational units, so respondents, animals, plots, etc… and you have new measurements that you want to add to your current dataset.

Adding Variables

Let’s start with our datasets – adding variables.  We have 2 files January, February, and March that have the same 25 individuals and weight measures taken in 3 months.  We want all of these to be in the same SPSS dataset so we can do an analysis across these 3 months.

What’s the first thing we need to do before adding February data to January?  If we were to try it as it is – but could happen?

HINT:  look at the variable names for the 2 files

Make the appropriate changes in the Variable View of SPSS.

Before we can merge any files, we need to ensure that all the datasets are sorted.  It looks like they are currently sorted, but let’s double-check by getting SPSS to run a sort anyway.  Sort all the datasets so we are all set for the next steps.

  • Data
  • Sort Cases
    • Put ID into the Sort by: Box
    • Sort Order – Ascending
  • OK

Let’s start with adding February data to January:

Make sure you are in the January file

  • Data
  • Merge Files
    • Add Variables
    • Since all of our datasets are open – you should see a list of the currently open files.  Select February.  If you closed all the datasets, the top box will be empty, then you will select External SPSS dataset and navigate to where you saved your file and open it
    • Continue
      • Your new dialogue box – you should see all of the variables available in both datasets.  * is the active dataset (January), + is the DataSet 2 or new combined dataset.
      • Review what variables are where.  New Active Dataset is the new merged dataset – think about what you want to see included in here.  Does it match your expectations?  Notice the list of variables in the Excluded box – do these make sense?  Notice that these are the ID and TRMT – variables which were duplicated in the 2 datasets.
      • We can add a key variable – the variable on which your variables is sorted by.  Let’s do this to make sure weights are matched to the appropriate ID.
      • Select Match cases on key variables
        • Select that the variables are sorted in both files – Select both files provide cases – Then select ID from the Excluded box and add it to the Key Variables box – notice how it disappears from the  new Active Dataset box too
  • Ok to run  – you will get a warning message – click ok

Take a look at the January file.  Does it look correct?  Remember you can double-check by looking at the February file.  If you are happy with the way it looked, save it under a new name – maybe Jan_Feb

Repeat this process to add the March dataset as well and save it as a file called Jan_Mar.sav

Close the February and March files, leaving the new Jan_Mar and April datasets open.

Adding Cases

So now we have a dataset with weight measures taken from January to March on the same individuals.  Now we need to add the new observations that have weight measures taken in January and April – this is a great example of Adding Cases to a dataset.

With the Jan_Mar dataset open:

  • Data
  • Merge Files
    • Add Cases
    • As this first case above, because our April dataset is open, you will see it listed in the dialogue box
    • Select April and Continue
      • You should now see a box that lists variables that are Paired and Unpaired.   Essentially think of these as the list of variables that are unique and present in both files
      • Since we want ALL the variables in our new dataset, select all the unpaired Variables and put into the Variables in the New Active Dataset – our new dataset
      • OK

Review the dataset, is it what you were expecting?

If it is, save it as Jan_Apr.sav.  Close the April dataset

Creating New Variables

Adding a new variable

Let’s start by creating a new variable called Wtgain that is the difference between the weight measured in January and the weight measured in March.

  • Transform
  • Compute Variable
    • This will open a new dialogue box
    • In the Target Variable enter a new variable name: wtgain
    • In the Numeric Expression: enter the calculation – select the Weight_mar and place in this box – then add the “-” following by the Weight_jan variable.  So it should read  weight_mar – weight_jan
    • OK

SPSS will add the new variable at the end of the current dataset.  Review and decide whether is completed the action you were expecting.  Since the original weight measures had no decimal places, let’s remove the decimal places added to the new variable.  Add a label to this new variable.

Save the dataset.

Recoding a variable

Sometimes we have a variable that we want to recode – so in our case we are going to create a new variable called wtclass_jan that will take the weights measured in January and put them into 3 weight classes:  1 = 13-16; 2 = 17-20;  3 = 21-24

SPSS has 2 functions that allow us to recode variables.  One is called, Recode into Same Variables… and the second, Recode into Different Variables…  In the interest of NOT writing over any data, I recommend that you use the Recode Into Different Variables… option.  As you work more and more with your data in SPSS, there may come a time when you may want to use the Recode Into Same Variables.. option, but understand that you will lose any data that you overwrite.

We want to create the new variable called wtclass_jan as described above.  To do this:

  • Transform
  • Recode Into Different Variables…
    • You will now see a dialogue box.  Select the Weight_jan variable from the left box and throw it over to the Input Variable ->Output Variable box
    • You will now provide a new name wtclass_jan in the Name box under the Output Variable section.
    • Add an appropriate label – maybe something like January Weight Class Group
      • Change
      • you should now see weight_jan -> wtclass_jan in the middle box
    • Now select Old and New Variables button
      • This is where you will tell SPSS that wtclass 1 contains any weight measures taken in January that include 13 – 16
      • There are different ways to do this – select the one that works best for you
      • Old value – will be the values in weight_jan – I will use Range – 13 – 16
      • New value – will be 1
      • Click Add
      • Create the next 2 groups
    • Continue
  • OK

Remember SPSS creates any new variable and adds it to the end of the current dataset.  We have a small dataset, so it is easy to find, but when you start to work with your own data, you may have 100s of variables – so remember that the new variables are added to the end.

Check the new wtclass_jan variable to see if it worked.

Save your dataset

What can SPSS do?

There are many data manipulations that can be performed in SPSS.  Whether you do these in Excel or SPSS it does not matter.  Document any changes you made to remember what you did.

Name

 

 

 

Data Visualization: Qualitative Data

Last time we met we took a closer look at bar charts and how we can create bar charts in the different software packages we have access to at the University of Guelph.  This time I’d like to change gears a bit and take a look at qualitative data and some of the options that are available to visualize this type of data.

First, qualitative data is a broad term, and the more I work in the field of data & statistics, the more I learn how people define and use the term “qualitative data”.  So, I am not going to try to define this term for this session, but rather concentrate on how we can use different types of visualization for this broadly defined type of data.

Resources, there are a large number of resources available to you to help you determine what the best option for your visualization is.  I will highlight a couple that I’m currently using.  If you are using others, please let me know and I can add them to the list of resources on this site.

Chart Suggestions – Chart suggestions presented by Andrew Abela from Extreme Presentations.  A decision chart based on the types of data you are using and the story you want to tell or visualize.

Qualitative Chart Chooser 3.0 by Jennifer Lyons and Stephanie Evergreen.  I just came across this wonderful resource.  I really like their approach of “What story are you trying to tell?”  Let’s work through a couple of examples and discuss your thoughts on this resource.

Books:  Data Visualization, A Handbook for Data Driven Design.  Andy Kirk  (2016).

As you peruse these resources, the one thing that you will notice is that you have to be very comfortable with the type of data you are working with.  A quick review:

  • Categorical:
    • Nominal – groups or levels that do you have any relationship between them or any order
      • Examples may include:  gender, religion, Yes/No
    • Ordinal – groups or levels that have an order to them
      • Examples may include: level of education, size, Likert scales
  • Continuous, Scale, Interval, Ratio
    • Data that was collected on a scale, interval, or ratio
      • Examples may include:  age, weight, temperature, weight gain

Bring examples of qualitative data to this session and we will look at possible options to visualize the data, the pros, and the cons.

Name

ARCHIVE: W18 RDM Workshop: Secure and Preserve

This workshop is the third in a series of 4 offered in partnership with Carol Perry, Associate Librarian Research and Scholarship.  These workshops are hands-on and have exercises associated with each aspect being covered in the workshop.

This workshop explores the areas of data anonymisation (making data secure) and data preservation in the context of Research Data Management.  The powerpoint presentation is available here, please review for more information and contact either Carol Perry or Michelle Edwards for questions.

Name

ARCHIVE: W18 SAS Workshop: Merging datasets and creating new variables

PDF copy of the SAS code used in the workshop

This workshop will walk through merging datasets and creating new variables in SAS.  You can choose to do all these steps in Excel and many people do.  But sometimes it can be easier to do it all in SAS.  Use Excel to enter your data, but then let SAS do the data manipulations and analysis.  One of SAS’ strengths is in data manipulation!

Let’s start by bringing the data into SAS.  Here is a link to an Excel file called Wksp2_data.  Download the file and save it on your computer.  There are 4 worksheets in this file.  I will leave it up to you as to how you bring the data into SAS, but you will need to bring all 4 sheets in as 4 different SAS datasets.  Let’s call the January worksheet – January, the February one February, and so on.

End goal:  4 SAS datasets:  January, February, March, April

Review your LOG window after you bring each one in to make sure you haven’t missed anything.  Then run a Proc Print to ensure the data looks like you were expecting it – in other words, it should look like the Excel worksheets.

Best Practice Note:  Add a comment line before each Proc to describe what you’re doing!  This way you’ll remember when you go back to review your syntax.

Merging Datasets

When you think about merging datasets there are 2 ways that you would want to merge.  Across or down.

Across or adding Variables

We have 3 months of data where the IDs are the same: January, February, and March.  We want to add the weights taken from February and March and add them to the end of the January file.  So we’re adding variables in this case.

To accomplish this we need to take 2 steps.  The first is to sort each dataset to make sure they are in the same order.

Proc sort data=January;
  by ID;
Run;

Proc sort data=February;
  by ID;
Run;

Proc sort data=March;
  by ID;
Run;

Step Two:

Create a new Dataset for the merged data.  Remember that the Data statement saves the data using the name you give it – so let’s call it jan_mar – for January to March.

Data jan_mar;

We’re then going to tell SAS that we want to merge the 3 datasets and we want to merge them by ID.

Data jan_mar;
  merge january february march;
  by ID;
Run;

What does your LOG say?  Is it right?  What is the best way to make sure SAS has done what you wanted it to do?

What happened?    How should we fix this?

We had 3 files that had the same variable names in each.  So merging them, by adding variables didn’t really work, because we were not adding new variables, but we were replacing the contents of the variables month and weight.  To fix this, we need to call the variables something different in each of the months.  I will add the _jan, _feb, _mar, and _apr to the weight variable in dataset.  I will also change the variable month month_1, month_2, etc…

Make these changes and rerun.  Did it work this time?

You should now see where there are missing data too.  Something that was not apparent the first time we ran this.

Adding Observations or Down

We have a 4th dataset that contains weight measurements taken in January and April, but you’ll notice that these belong to individuals who were not included in the first 3 data files of the trial.  So we will need to add these to the bottom of the dataset currently called jan_mar.

We already have a file called April that contains the April data and now we have one called jan_mar that contains the merged data from January to March.  Since these individuals were not included in the original data, we do not need to sort them, since ID plays no role here.

To add the data to our merged dataset we use the SET command rather than the MERGE command:

Data jan_apr;
  set jan_mar april;
Run;

Proc print data=jan_apr;
Run;

Creating New Variables

We now have all the weight data for the individuals in our trial and would like to calculate the individual weight gains from January to March.  As noted before we can calculate these in Excel, but let’s use SAS to do it, especially since we have a new merged dataset.

In order to calculate the weight gain, we will be touching the data, and whenever we touch the data, we need to work within a DATA step.  So let’s create a new dataset and call it jan_apr_wtgain.

 

Since we will be using a dataset that is already available to us in SAS we use the SET command again to recall the jan_apr dataset.

Data jan_apr_wtgain;
  set jan_apr;

Now we can create our new variable.  Let’s call it wtgain and it will be the difference between the weight taken in January (weight_jan) and the weight taken in March (weight_mar).

Data jan_apr_wtgain;
  set jan_apr;

  wtgain = weight_mar – weight_jan;
Run;

Run a Proc Print to check your data.  Did it work?

Recoding a variable

Sometimes we have a variable that we want to recode – so in our case we are going to create a new variable called wtclass that will take the weights measured in January and put them into 3 weight classes:  1 = 13-16; 2 = 17-20;  3 = 21-24

There are a number of different ways to accomplish this, I will post one here and depending on time in the workshop, I will show you others.  I will also include the other options in the accompanying SAS syntax – to be posted after the workshop.

We are working with data again, so need to work within a DATA step.  Let’s use the jan_apr_wtgain dataset.

Data jan_apr_wtgain;
  set jan_apr;

  wtgain = weight_mar – weight_jan;

  if weight_jan < 17 then wtclass = 1;
  if weight_jan ge 17 and weight_jan < 21 then wtclass = 2;
  if weight_jan >20 then wtclass = 3;
Run;

Run a Proc Print and see what happened.  Are we happy with these results?

What can SAS do?

There are so many more manipulations that SAS can do.  These are just a couple of them and ones that may help you out as you start using SAS for your own research data.  Check out others that have been discussed in the past on the SASsyFridays blog