http://www.comp.lancs.ac.uk/~kristof/research/notes/basicstats/index.html
Basic Statistics and Metrics for Sensor Analysis
|
||
Notations |
||
Say we have a system where at each point in time, we obtain d values coming from d different sensors. We call those d values features, and define the vector containing the d feature as a sample vector or feature vector x(t). Over a given period of time, we obtain several of these sample vectors and can put these vectors in one matrix, that we call dataset D. A row vector of dataset D is called a time vector, while a column vector was already defined as a sample vector. A timeseries plot draws each feature as a function of time, and illustrates how the sensors behave over time. A cluster plot draws the sample vectors, or sub-vectors of the sample vectors, in their vectorspace. There is no a priori knowledge about the distribution of the observed data available. |
Example of a timeseries plot over 3000 samples (approx.48 hours) |
|
The mean of a feature i over n d-dimensional samples is defined as: This is usually referred to as the arithmetic mean. Other statistics that classify as means are: arithmetic-geometric mean, geometric mean, harmonic mean, quadratic mean, root-mean-square |
||
The variance of a feature i over n d-dimensional samples is defined as: The square root of the variance is called the standard deviation: It basically is a measure of the extent to which a distribution varies from its mean.
|
||
The covariance of two features i and j from n d-dimensional samples in a dataset D is defined as: Several properties of the covariance are:
sometimes the lower-part of the division is defined as n-1 instead of n
|
||
All possible covariances can be put together in a covariance matrix Cov(D) for dataset D:
|
||
Lets assume that x and y are two d-dimensional vectors. The aim is to calculate how close these vectors are to each other. For this, there exist several common techniques: The Manhattan, Cityblock or L1 metric is defined as: The Euclidean or L2 metric is defined as: The L∞, Chebychev or Maximum metric is defined as: The Minkowski or L metric is defined as: notice that λ=1 gives the cityblock distance, while λ=2 gives the Euclidean distance and λ=∞ gives the Chebychev distance. The Mahalanobis metric is defined as: note that if the covariance matrix Cov(D) is the identity matrix, then the Mahalanobis distance would be equal to the Euclidean. |
The different contours for constant Manhattan, Euclidean, L infinity and Mahalanobis metrics (given dependent dimensions) in 2D space. |
|
The Canberra metric is defined as: the output ranges from 0 to the number of variables used. The Canberra distance is very sensitive to small changes near zero. The Correlation metric is defined as: The correlation similarity measure is the covariance, divided by the variances, and takes values between -1 and 1. With this measure, the relative direction of the two observation vectors is important. The correlation similarity measure gives the cosine of the angle between the two observations vectors measured from the mean. This measure makes more sense for time vectors, as opposed to sample vectors. The Auto Correlation metric is defined as: and is just the correlation between a series and a lagged version of itself, or the covariance divided by the variance. A high correlation is likely to indicate a periodicity in the signal of the corresponding time duration. If the autocorrelation coefficient is calculated for all lags k=0,1,2...N-1 the resulting series is called the autocorrelation seriesor the correlogram. The Angular metric is defined as: which is the cosine of the angle between the two observation vectors measured from zero and takes values from -1 to 1. A neighbourhood of a vector using a certain metric, is found by varying the second vector in the distance formula so that the right side of the formula is smaller or equal to the left side.
|
||
Compiled by Kristof Van Laerhoven.
Thanks to Sergio R. Caprile for his comments..