All technological notes.
Vector Space Model / term vector model
Assume t distinct terms,each term i in a document or query j, is given a weight wij.
dj = (w1j, w2j, …, wtj)q = (w1q, w2q, …, wnq)The document term weight wij is represented based on some variation of the TF (term frequency) or TF-IDF (term frequency-inverse document frequency) scheme.
Term Frequency-Inverse Document Frequency / TF-IDF
The idea is that, terms that capture the essence of a document occur frequently in the document (that is, their TF is high), but if such a term is good in discriminating the document from others, it must occur in only a few documents in the general population (that is, its IDF should be high as well).
TF: the essence of a term occur frequently in the document 某个字符在某个文档中出现的频率IDF: a term must occur in only a few documents in the general population 某个字符在少数文档出现的频率Term Frequency:
the number of times the term appears in a document compared to the total number of words in the document.
TFij = number of times the term i in the doc j / total number of terms in doc jInverse Document Frequency:
IDFi = log(number of doc / number of doc contain term i)TF-IDF = TF * IDF

1st:
TFi = numi/sum, numi: number of term i; sum: number of total term.IDF = log(N/ni), N: number of doc; ni: number of doc contain term i.2nd:
TFi = 1 + log(numi/sum)IDF = log(1 + N/ni), N: number of doc; ni: number of doc contain term i.3rd:
TFi = numi/sumIDF = log(N/ni)Cosine measure
