Continuous NHANES Web Tutorial: Append & Merge Datasets: Merging Datasets

Key Concepts About Merging Data in NHANES

Since 1999, NHANES data files have been released for public use in 2-year groupings, also known as cycles. For each data cycle, data files are organized by their collection method, which can fall under one of four components: Demographic, Examination, Laboratory, and Questionnaire. To allow for timelier releases, different components of the data files are usually released at different times as they are completed. Putting the components of these data files together in a dataset is called merging.

The first step in merging data is to sort each of the data files by a unique identifier. In NHANES data, this unique identifier is known as the sequence number (SEQN). NHANES uses SEQN to identify each sample person, so SEQN is the variable you must use to merge data files. To ensure that all observations are ordered in the same way in each data file, you need to sort each data file by the SEQN variable. Use the proc sort procedure in SAS or the sort command in Stata to sort the data. After sorting the data files, you can continue merging.

After you have merged the data files, it is advisable that you check the contents again to make sure that the files merged correctly. In SAS, use the proc contents statement to list all variable names and labels; use the proc means statement to check the number of observations for each variable as well as missing, minimum, and maximum values. In Stata, use the describe command in Stata to list all variable names and labels; use the tabstat command in Stata to check the number of observations for each variable as well as missing, minimum, and maximum values. Use the tabulate command to check the merge. The merge command will create new variables _merge, _merge1…., depending on the each dataset listed in the using list of the merge command. The variable _merge takes on one of three values: _merge=1, the observation is present on the master dataset; _merge=2, the observation is present in one of the using datasets (but not the master dataset); _merge=3, the observation is present in at least two datasets, either master or using.

WARNING

The master dataset prepared in the Append and Merge Data Module contains all sample persons who have completed the household interview. In other words, the master dataset includes observations who are interviewed only (but not MEC examined), and those both interviewed and examined.

It is worth pointing out that some analysts would select the study population at the SAS data step, and choose to save the SAS data file with only observations meeting the selection criteria for the study, for instance, only include those who have completed MEC exams. The disadvantage of doing so is that you will no longer be able to examine the household interview items by using interview weights, such as looking at interview questions on blood pressure by demographic variables. Also you will not be able to examine non-response rates for the MEC items (i.e. the rate of those who are interviewed but not examined).