data transformation
(noun)
The application of a deterministic mathematical function to each point in a data set.
Examples of data transformation in the following topics:
-
When to Use These Tests
- "Ranking" refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted.
- In statistics, "ranking" refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted.
- Data transformation refers to the application of a deterministic mathematical function to each point in a data set—that is, each data point $z_i$ is replaced with the transformed value $y_i = f(z_i)$, where $f$ is a function.
- Data can also be transformed to make it easier to visualize them.
- Indicate why and how data transformation is performed and how this relates to ranked data.
-
Exercises
- If the arithmetic mean of log10 transformed data were 3, what would be the geometric mean?
- Using Tukey's ladder of transformation, transform the following data using a λof 0.5: 9, 16, 25
- In the ADHD case study, transform the data in the placebo condition (D0) with λ's of .5, 0, -.5, and -1.
- How does the skew in each of these compare to the skew in the raw data.
- Which transformation leads to the least skew?
-
Transforming data (special topic)
- When data are very strongly skewed, we sometimes transform them so they are easier to model.
- A transformation is a rescaling of the data using a function.
- Transformed data are sometimes easier to work with when applying statistical models because the transformed data are much less skewed and outliers are usually less extreme.
- While there is a positive association in each plot, the transformed data show a steadier trend, which is easier to model than the untransformed data.
- (b) A scatterplot of the same data but where each variable has been log-transformed.
-
Log Transformations
- State how a log transformation can help make a relationship clear
- The log transformation can be used to make highly skewed distributions less skewed.
- The comparison of the means of log-transformed data is actually a comparison of geometric means.
- Therefore, if the arithmetic means of two sets of log-transformed data are equal then the geometric means are equal.
- Scatter plots of brain weight as a function of body weight in terms of both raw data (upper panel) and log-transformed data (lower panel).
-
Box-Cox Transformations
- Data that are normal lead to a straight line on the q-q plot.
- Such data are often strongly skewed, as is clear from Figure 3.
- The kernel density plot of the optimally transformed data is shown in the left frame of Figure 4.
- (L) Density plot of the 1973 British income data.
- (L) Density plot of the 1973 British income data transformed with λ = 0.21.
-
Linear Transformations
- Often it is necessary to transform data from one measurement scale to another.
- To transform feet to inches, you simply multiply by 12.
- Similarly, to transform inches to feet, you divide by 12.
- The transformation consists of multiplying by a constant and then adding a second constant.
- Such transformations are therefore called linear transformations.
-
Tukey Ladder of Powers
- Plotting the data on a scatter diagram is the first step.
- These data are plotted two ways in Figure 1.
- The right frame displays the transformed data, together with the linear fit for the 1790-1960 period.
- The demonstration in Figure 7 shows distributions of the data from the Stereograms case study as transformed with various values of λ.
- Keep in mind that λ = 1 is the raw data.
-
Exploratory Data Analysis (EDA)
- Exploratory data analysis is an approach to analyzing data sets in order to summarize their main characteristics, often with visual methods.
- Exploratory data analysis (EDA) is an approach to analyzing data sets in order to summarize their main characteristics, often with visual methods.
- EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, handling missing values, and making transformations of variables as needed.
- Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data and possibly formulate hypotheses that could lead to new data collection and experiments.
- Tukey promoted the use of the five number summary of numerical data:
-
Email data
- The email data set was first presented in Chapter 1 with a relatively small number of variables.In fact, there are many more variables available that might be useful for classifying spam.Descriptions of these variables are presented in Table 8.13.The spam variable will be the outcome, and the other 10 variables will be the model predictors.While we have limited the predictors used in this section to be categorical variables (where many are represented as indicator variables), numerical predictors may also be used in logistic regression.
- We could resolve this issue by transforming these variables (e.g. using a log-transformation), but we will omit this further investigation for brevity.
-
References
- An analysis of transformations, Journal of the Royal Statistical Society, Series B, 26, 211-252.