Proximity measures

December 24, 2024

Proximity Measures

Proximity measures in pattern recognition are metrics used to evaluate the similarity or dissimilarity between patterns, data points, or feature vectors. These measures play a crucial role in clustering, classification, and other machine learning tasks. Proximity is typically expressed in terms of distance (dissimilarity) or similarity between points.

Common Proximity Measures:

Distance Measures (Dissimilarity):

Euclidean Distance:
Formula: $d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$
Interpretation: The straight-line distance between two points in Euclidean space.
Use: Works well for continuous data and in tasks like k-means clustering.
Manhattan Distance (City Block):
Formula: $d(x, y) = \sum_{i=1}^{n} |x_i - y_i|$
Interpretation: The sum of the absolute differences along each dimension.
Use: Useful when dealing with high-dimensional data.
Minkowski Distance:
Generalization of Euclidean and Manhattan distances.
Formula: $d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}$
Special cases:
$p=1$ : Manhattan distance.
$p=2$ : Euclidean distance.
Use: Flexible for different applications by varying $p$ .
Cosine Distance:
Formula (Cosine Similarity): $\text{sim}(x, y) = \frac{x \cdot y}{\|x\| \|y\|}$
Distance: $d(x, y) = 1 - \text{sim}(x, y)$
Interpretation: Measures the angle between two vectors.
Use: Common in text and document similarity tasks.
Mahalanobis Distance:
Formula: $d(x, y) = \sqrt{(x - y)^T S^{-1} (x - y)}$
$S$ : Covariance matrix of the data.
Interpretation: Takes correlations between variables into account.
Use: Effective for data with correlated features.
Hamming Distance:
Formula: $d(x, y) = \sum_{i=1}^{n} \mathbb{1}(x_i \neq y_i)$
Interpretation: Counts the number of differing bits or characters.
Use: Suitable for categorical or binary data.

Comments