Mahalanobis distance is preserved under full-rank linear transformations of the space spanned by the data. This means that if the data has a nontrivial nullspace, Mahalanobis distance can be computed after projecting the data non-degenerately down onto any space of the appropriate dimension for the data. We can find useful decompositions of the squared Mahalanobis distance that help to explain some reasons for the outlyingness of multivariate observations and also provide a graphical tool for identifying outliers. Our first step would be to find the centroid or center of mass of the sample points.
|Published (Last):||1 November 2019|
|PDF File Size:||5.30 Mb|
|ePub File Size:||7.23 Mb|
|Price:||Free* [*Free Regsitration Required]|
I previously described how to use Mahalanobis distance to find outliers in multivariate data. This article takes a closer look at Mahalanobis distance.
A subsequent article will describe how you can compute Mahalanobis distance. Distance in standard units In statistics, we sometimes measure "nearness" or "farness" in terms of the scale of the data. Often "scale" means "standard deviation. You can also specify the distance between two observations by specifying how many standard deviations apart they are.
For many distributions, such as the normal distribution, this choice of scale also makes a statement about probability. Specifically, it is more likely to observe an observation that is about one standard deviation from the mean than it is to observe one that is several standard deviations away. Because the probability density function is higher near the mean and nearly zero as you move many standard deviations away.
For normally distributed data, you can specify the distance from the mean by computing the so-called z-score. This is a dimensionless quantity that you can interpret as the number of standard deviations that x is from the mean. Distance is not always what it seems You can generalize these ideas to the multivariate normal distribution. The following graph shows simulated bivariate normal data that is overlaid with prediction ellipses.
The prediction ellipses are contours of the bivariate normal density function. In the graph, two observations are displayed by using red stars as markers. The first observation is at the coordinates 4,0 , whereas the second is at 0,2. The question is: which marker is closer to the origin? The origin is the multivariate center of this distribution. The answer is, "It depends how you measure distance. However, for this distribution, the variance in the Y direction is less than the variance in the X direction, so in some sense the point 0,2 is "more standard deviations" away from the origin than 4,0 is.
Notice the position of the two observations relative to the ellipses. What does this mean? It means that the point at 4,0 is "closer" to the origin in the sense that you are more likely to observe an observation near 4,0 than to observe one near 0,2. The probability density is higher near 4,0 than it is near 0,2. In this sense, prediction ellipses are a multivariate generalization of "units of standard deviation. A point p is closer than a point q if the contour that contains p is nested within the contour that contains q.
Defining the Mahalanobis distance You can use the probability contours to define the Mahalanobis distance. The Mahalanobis distance has the following properties: It accounts for the fact that the variances in each direction are different. It accounts for the covariance between variables.
It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance. For univariate normal data, the univariate z-score standardizes the distribution so that it has mean 0 and unit variance and gives a dimensionless quantity that specifies the distance from an observation to the mean in terms of the scale of the data.
After transforming the data, you can compute the standard Euclidian distance from the point z to the origin. This measures how far from the origin a point is, and it is the multivariate generalization of a z-score. You can rewrite zTz in terms of the original correlated variables. The Mahalanobis distance accounts for the variance of each variable and the covariance between variables. Geometrically, it does this by transforming the data into standardized uncorrelated data and computing the ordinary Euclidean distance for the transformed data.
In this way, the Mahalanobis distance is like a univariate z-score: it provides a way to measure distances that takes into account the scale of the data.
Distância de Mahalanobis
Cálculo de outliers en R: Distancia Gauss y Mahalanobis