🔗 Jaccard Index

🔗 Computer science 🔗 Statistics

The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is frequently referred to as the Critical Success Index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communauté, and independently formulated again by T. Tanimoto. Thus, the Tanimoto index or Tanimoto coefficient are also used in some fields. However, they are identical in generally taking the ratio of Intersection over Union. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

J ( A , B ) = | A ∩ B | | A ∪ B | = | A ∩ B | | A | + | B | − | A ∩ B | . {\displaystyle J(A,B)={{|A\cap B|} \over {|A\cup B|}}={{|A\cap B|} \over {|A|+|B|-|A\cap B|}}.}

Note that by design, 0 ≤ J ( A , B ) ≤ 1. {\displaystyle 0\leq J(A,B)\leq 1.} If A intersection B is empty, then J(A,B) = 0. The Jaccard coefficient is widely used in computer science, ecology, genomics, and other sciences, where binary or binarized data are used. Both the exact solution and approximation methods are available for hypothesis testing with the Jaccard coefficient.

Jaccard similarity also applies to bags, i.e., Multisets. This has a similar formula, but the symbols mean bag intersection and bag sum (not union). The maximum value is 1/2.

J ( A , B ) = | A ∩ B | | A ⊎ B | = | A ∩ B | | A | + | B | . {\displaystyle J(A,B)={{|A\cap B|} \over {|A\uplus B|}}={{|A\cap B|} \over {|A|+|B|}}.}

The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union:

d J ( A , B ) = 1 − J ( A , B ) = | A ∪ B | − | A ∩ B | | A ∪ B | . {\displaystyle d_{J}(A,B)=1-J(A,B)={{|A\cup B|-|A\cap B|} \over |A\cup B|}.}

An alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference A △ B = ( A ∪ B ) − ( A ∩ B ) {\displaystyle A\triangle B=(A\cup B)-(A\cap B)} to the union. Jaccard distance is commonly used to calculate an n × n matrix for clustering and multidimensional scaling of n sample sets.

This distance is a metric on the collection of all finite sets.

There is also a version of the Jaccard distance for measures, including probability measures. If μ {\displaystyle \mu } is a measure on a measurable space X {\displaystyle X} , then we define the Jaccard coefficient by

J μ ( A , B ) = μ ( A ∩ B ) μ ( A ∪ B ) , {\displaystyle J_{\mu }(A,B)={{\mu (A\cap B)} \over {\mu (A\cup B)}},}

and the Jaccard distance by

d μ ( A , B ) = 1 − J μ ( A , B ) = μ ( A △ B ) μ ( A ∪ B ) . {\displaystyle d_{\mu }(A,B)=1-J_{\mu }(A,B)={{\mu (A\triangle B)} \over {\mu (A\cup B)}}.}

Care must be taken if μ ( A ∪ B ) = 0 {\displaystyle \mu (A\cup B)=0} or ∞ {\displaystyle \infty } , since these formulas are not well defined in these cases.

The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function.

Discussed on