Continuous NHANES Web Tutorial: Append & Merge Datasets: Task 2c

Task 2c: How to Merge NHANES Data in Stata

Here are the steps to appending NHANES data in Stata:

Step 1: Sort data files by SEQN

The first step in merging data is to sort each of the data files by a unique identifier. In NHANES data, this unique identifier is known as the sequence number (SEQN). NHANES uses SEQN to identify each sample person, so SEQN is the variable you must use to merge data files. To ensure that all observations are ordered in the same way in each data file, you need to sort each data file by the SEQN variable.

Use the sort command to sort the data and the clear option to replace data already in memory. The save command with the replace option replaces the unsorted dataset with the sorted dataset. The general format of the command is below.

use filename, [clear]
sort varname
save filename, [replace]

Sort the 4-year blood pressure examination data by SEQN.

use C:\Nhanes\Data\bpx_4yr, clear
sort seqn
save C:\Nhanes\Data\bpx_4yr, replace

Sort the 4-year blood pressure questionnaire data by SEQN.

use C:\Nhanes\Data\bpq_4yr, clear
sort seqn
save C:\Nhanes\Data\bpq_4yr, replace

Sort the 4-year medical conditions questionnaire data by SEQN.

use C:\Nhanes\Data\mcq_4yr, clear
sort seqn
save C:\Nhanes\Data\mcq_4yr, replace

Sort the 4-year laboratory data by SEQN.

use C:\Nhanes\Data\lab13_4yr, clear
sort seqn
save C:\Nhanes\Data\lab13_4yr, replace

Sort the 4-year demographic data by SEQN.

use C:\Nhanes\Data\demo_4yr, clear
sort seqn
save C:\Nhanes\Data\demo_4yr, replace

Step 2: Merge data by SEQN

After sorting the data files, you can continue merging the data to the master file (demo_4yr) using the merge command. Remember that you will always merge on a unique identifier in this case, the SEQN variable. The general form of the merge command is as follows:

Merge [varlist] using filename [, options]

use "c:\NHANES\Data\Demo_4yr", clear
merge seqn using C:\Nhanes\Data\bpx_4yr C:\Nhanes\Data\bpq_4yr C:\Nhanes\Data\mcq_4yr C:\Nhanes\Data\lab13_4yr
save C:\Nhanes\Data\demo_bp, replace

IMPORTANT NOTE

Note that the datasets in the using list could have been sorted by using the option unique sort at the end of the merge command. The option unique will match variables uniquely identifying observations on both data in memory and in filename. The option sort will sort the master and using datasets by match variables before the merge.

Step 3: Check the results

After you have merged the data files, it is advisable that you check the contents again to make sure that the files merged correctly. Use the describe command to list all variable names and labels. Use the tabulate command to check how the merge went. The merge command, used in the previous step, will create new variables _merge, _merge1., depending on the each dataset listed in the using list of the merge command. The variable _merge takes on one of three values: _merge=1, the observation is present on the master dataset; _merge=2, the observation is present in one of the using datasets (but not the master dataset); _merge=3, the observation is present in at least two datasets, either master or using. Finally, use the tabstat command to check the number of observations for each variable as well as minimum, and maximum values in the merged dataset (demo_bp). Use the nmissing command to list the number of missing observations.

IMPORTANT NOTE

The nmissing command can be installed from http://www.stata-journal.com/software/sj5-4/dm67_3/.

describe
tab _merge

tabstat seqn sddsrvyr ridstatr ridpreg sdmvpsu sdmvstra wtmec4yr ///
riagendr ridageyr ridreth1 bpx* bpq* mcq*, stat(n min max)
nmissing bpx* bpq* mcq*

Highlighted results of this program:

This dataset contains 21,004 observations. No new observations have been added to the dataset. Of those observations, 43 (_merge=1) were only in the master dataset (demo_4yr), while 20,961 observations (_merge=3) were in at least two datasets.
Using the variable list provided by the describe command, check that the variables in your original datasets are in the merged dataset in this case, they all are.
The N row has the number of observations with valid data for each variable in the dataset.
The minimum row lists the minimum value for each variable in the dataset.
The maximum row lists the maximum value for each variable in the dataset.
The nmissing output has the number of missing observations for each variable in the dataset.

Close Window