Main Content

cosineSimilarity

Document similarities with cosine similarity

Description

example

similarities= cosineSimilarity(documents)returns the pairwise cosine similarities for the specified documents using the tf-idf matrix derived from their word counts. The score insimilarities(i,j)represents the similarity betweendocuments(i)anddocuments(j).

example

similarities= cosineSimilarity(documents,queries)returns similarities betweendocumentsandqueriesusing tf-idf matrices derived from the word counts indocuments. The score insimilarities(i,j)represents the similarity betweendocuments(i)andqueries(j).

example

similarities= cosineSimilarity(bag)returns pairwise similarities for the documents encoded by the specified bag-of-words or bag-of-n-grams model using the tf-idf matrix derived from the word counts inbag. The score insimilarities(i,j)represents the similarity between theith andjth documents encoded bybag.

similarities= cosineSimilarity(bag,queries)returns similarities between the documents encoded by the bag-of-words or bag-of-n-grams modelbagandqueriesusing tf-idf matrices derived from the word counts inbag. The score insimilarities(i,j)represents the similarity between theith document encoded bybagandqueries(j).

example

similarities= cosineSimilarity(M)returns similarities for the data encoded in the row vectors of the matrixM. The score insimilarities(i,j)represents the similarity betweenM(i,:)andM(j,:).

similarities= cosineSimilarity(M1,M2)returns similarities between the documents encoded in the matricesM1andM2. The score insimilarities(i,j)corresponds to the similarity betweenM1(i,:)andM2(j,:).

Examples

collapse all

Create an array of tokenized documents.

textData = ["the quick brown fox jumped over the lazy dog""the fast brown fox jumped over the lazy dog""the lazy dog sat there and did nothing""the other animals sat there watching"]; documents = tokenizedDocument(textData)
documents = 4x1 tokenizedDocument: 9 tokens: the quick brown fox jumped over the lazy dog 9 tokens: the fast brown fox jumped over the lazy dog 8 tokens: the lazy dog sat there and did nothing 6 tokens: the other animals sat there watching

Calculate the similarities between them using thecosineSimilarityfunction. The output is a sparse matrix.

similarities = cosineSimilarity(documents);

Visualize the similarities between the documents in a heat map.

figure heatmap(similarities); xlabel("Document") ylabel("Document") title("Cosine Similarities")

Figure contains an object of type heatmap. The chart of type heatmap has title Cosine Similarities.

Scores close to one indicate strong similarity. Scores close to zero indicate weak similarity.

Create an array of input documents.

str = ["the quick brown fox jumped over the lazy dog""the fast fox jumped over the lazy dog""the dog sat there and did nothing""the other animals sat there watching"]; documents = tokenizedDocument(str)
documents = 4x1 tokenizedDocument: 9 tokens: the quick brown fox jumped over the lazy dog 8 tokens: the fast fox jumped over the lazy dog 7 tokens: the dog sat there and did nothing 6 tokens: the other animals sat there watching

Create an array of query documents.

str = ["a brown fox leaped over the lazy dog""another fox leaped over the dog"]; queries = tokenizedDocument(str)
queries = 2x1 tokenizedDocument: 8 tokens: a brown fox leaped over the lazy dog 6 tokens: another fox leaped over the dog

Calculate the similarities between input and query documents using thecosineSimilarityfunction. The output is a sparse matrix.

similarities = cosineSimilarity(documents,queries);

Visualize the similarities of the documents in a heat map.

figure heatmap(similarities); xlabel("Query Document") ylabel("Input Document") title("Cosine Similarities")

Figure contains an object of type heatmap. The chart of type heatmap has title Cosine Similarities.

Scores close to one indicate strong similarity. Scores close to zero indicate weak similarity.

Create a bag-of-words model from the text data insonnets.csv.

文件名="sonnets.csv"; tbl = readtable(filename,'TextType','string'); textData = tbl.Sonnet; documents = tokenizedDocument(textData); bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3527 double] Vocabulary: ["From" "fairest" "creatures" "we" ... ] NumWords: 3527 NumDocuments: 154

Calculate similarities between the sonnets using thecosineSimilarityfunction. The output is a sparse matrix.

similarities = cosineSimilarity(bag);

Visualize the similarities of the first five documents in a heat map.

figure heatmap(similarities(1:5,1:5)); xlabel("Document") ylabel("Document") title("Cosine Similarities")

Figure contains an object of type heatmap. The chart of type heatmap has title Cosine Similarities.

Scores close to one indicate strong similarity. Scores close to zero indicate weak similarity.

For bag-of-words input, thecosineSimilarityfunction calculates the cosine similarity using the tf-idf matrix derived from the model. To compute the cosine similarities on the word count vectors directly, input the word counts to thecosineSimilarityfunction as a matrix.

Create a bag-of-words model from the text data insonnets.csv.

文件名="sonnets.csv"; tbl = readtable(filename,'TextType','string'); textData = tbl.Sonnet; documents = tokenizedDocument(textData); bag = bagOfWords(documents)
bag = bagOfWords with properties: Counts: [154x3527 double] Vocabulary: ["From" "fairest" "creatures" "we" ... ] NumWords: 3527 NumDocuments: 154

Get the matrix of word counts from the model.

M = bag.Counts;

Calculate the cosine document similarities of the word count matrix using thecosineSimilarityfunction. The output is a sparse matrix.

similarities = cosineSimilarity(M);

Visualize the similarities of the first five documents in a heat map.

figure heatmap(similarities(1:5,1:5)); xlabel("Document") ylabel("Document") title("Cosine Similarities")

Figure contains an object of type heatmap. The chart of type heatmap has title Cosine Similarities.

Scores close to one indicate strong similarity. Scores close to zero indicate weak similarity.

输入参数

collapse all

Input documents, specified as atokenizedDocumentarray, a string array of words, or a cell array of character vectors. Ifdocumentsis not atokenizedDocumentarray, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use atokenizedDocumentarray.

Input bag-of-words or bag-of-n-grams model, specified as abagOfWordsobject or abagOfNgramsobject. Ifbagis abagOfNgrams对象,那么function treats each n-gram as a single word.

Set of query documents, specified as one of the following:

  • AtokenizedDocumentarray

  • A 1-by-Nstring array representing a single document, where each element is a word

  • A 1-by-Ncell array of character vectors representing a single document, where each element is a word

To compute term frequency and inverse document frequency statistics, the function encodesqueriesusing a bag-of-words model. The model it uses depends on the syntax you call it with. If your syntax specifies the input argumentdocuments, then it usesbagOfWords(documents). If your syntax specifiesbag, then the function encodesqueriesusingbagthen uses the resulting tf-idf matrix.

Input data, specified as a matrix. For example,Mcan be a matrix of word or n-gram counts or a tf-idf matrix.

Data Types:double

Output Arguments

collapse all

Cosine similarity scores, returned as a sparse matrix:

  • Given a single array of tokenized documents,similaritiesis aN-by-Nsymmetric matrix, wheresimilarities(i,j)represents the similarity betweendocuments(i)anddocuments(j), and N is the number of input documents.

  • Given an array of tokenized documents and a set of query documents,similaritiesis anN1-by-N2matrix, wheresimilarities(i,j)represents the similarity betweendocuments(i)and thejth query document, andN1andN2represents the number of documents indocumentsandqueries, respectively.

  • Given a single bag-of-words or bag-of-n-grams model,similaritiesis abag.NumDocuments-by-bag.NumDocumentssymmetric matrix, wheresimilarities(i,j)represents the similarity between theith andjth documents encoded bybag.

  • Given a bag-of-words or bag-of-n-grams models and a set of query documents,similaritiesis abag.NumDocuments-by-N2matrix, wheresimilarities(i,j)represents the similarity between theith document encoded bybagand thejth document inqueries, andN2corresponds to the number of documents inqueries.

  • Given a single matrix,similaritiesis asize(M,1)-by-size(M,1)symmetric matrix, wheresimilarities(i,j)represents the similarity betweenM(i,:)andM(j,:).

  • Given two matrices,similaritiesis ansize(M1,1)-by-size(M2,1)matrix, wheresimilarities(i,j)represents the similarity betweenM1(i,:)andM2(j,:).

Introduced in R2020a