Jump to content

User:potok~enwiki

From Wikipedia, the free encyclopedia

tf-icf


The tf–icf weight (term frequency–inverse corpus frequency) is a variant of tf-idf. In tf-idf the inverse document frequency is found by taking the inverse of the count of the number of times a term appears in a set of documents. For example, if I have 100 documents, and the term "apple" appears in 37 of them, the idf value is 1/37. A drawback with this approach is that if I add a new document to the set, the idf value for each term must be recalculated. This can be computationally expensive for large document sets. The tf-icf approach determines the inverse corpus frequency by using a large corpus or set of existing document, rather than the document set itself. For example, rather than calculating the inverse frequency of apple over 100 documents, the inverse frequency of apple is calculated over a corpus of millions of documents.


Mathematical details

[edit]

Example

[edit]

Applications in Vector Space Model

[edit]

See also

[edit]

References

[edit]
  • Spärck Jones, Karen (1972). "A statistical interpretation of term specificity and its application in retrieval" (PDF). Journal of Documentation. 28 (1): 11–21. doi:10.1108/eb026526.{{cite journal}}: CS1 maint: numeric names: authors list (link)
  • Salton, G. and M. J. McGill (1983). Introduction to modern information retrieval. McGraw-Hill. ISBN 0070544840.
  • Salton, Gerard, Edward A. Fox & Harry Wu (1983). "Extended Boolean information retrieval". Communications of the ACM. 26 (11): 1022–1036. doi:10.1145/182.358466. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: multiple names: authors list (link)
  • Salton, Gerard and Buckley, C. (1988). "Term-weighting approaches in automatic text retrieval". Information Processing & Management. 24 (5): 513–523. doi:10.1016/0306-4573(88)90021-0.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[[Category:Information retrieval]] [[Category:Artificial intelligence applications]] [[Category:Natural language processing]] [[Category:Ranking functions]]