Calculate TF-IDF weight with IDF formula used by Spark MLlib's IDF:
Calculate TF-IDF weight with IDF formula used by Spark MLlib's IDF:
termFreq * log((numDocs + 1.0) / (docFreq + 1.0))
Use this weight if working with MLLib vectorized documents.
Note: this is not consistent with the MapReduce seq2sparse implementation of TF-IDF weights which is implemented using Lucene DefaultSimilarity's TF-IDF calculation:
sqrt(termFreq) * (log(numDocs / (docFreq + 1)) + 1.0)
term freq
doc freq
Length of the document - UNUSED
the total number of docs
The TF-IDF weight as calculated by Spark MLlib's IDF