TFIDFConverter (Mahout Map-Reduce 0.13.0 API)

java.lang.Object
- org.apache.mahout.vectorizer.tfidf.TFIDFConverter

```
public final class TFIDFConverter
extends Object
```
This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input should have a WritableComparable key containing and a VectorWritable value containing the term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf format

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`FEATURE_COUNT`
`static String`	`FREQUENCY_FILE`
`static String`	`MAX_DF`
`static String`	`MIN_DF`
`static String`	`VECTOR_COUNT`
`static String`	`WORDCOUNT_OUTPUT_FOLDER`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static Pair<Long[],List<org.apache.hadoop.fs.Path>>`	`calculateDF(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, int chunkSizeInMegabytes)` Calculates the document frequencies of all terms from the input set of vectors in `SequenceFile` format.
`static void`	`processTfIdf(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures, int minDf, long maxDF, float normPower, boolean logNormalize, boolean sequentialAccessOutput, boolean namedVector, int numReducers)` Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in `SequenceFile` format.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - VECTOR_COUNT
```
public static final String VECTOR_COUNT
```
    See Also:
    
    Constant Field Values
  - FEATURE_COUNT
```
public static final String FEATURE_COUNT
```
    See Also:
    
    Constant Field Values
  - MIN_DF
```
public static final String MIN_DF
```
    See Also:
    
    Constant Field Values
  - MAX_DF
```
public static final String MAX_DF
```
    See Also:
    
    Constant Field Values
  - FREQUENCY_FILE
```
public static final String FREQUENCY_FILE
```
    See Also:
    
    Constant Field Values
  - WORDCOUNT_OUTPUT_FOLDER
```
public static final String WORDCOUNT_OUTPUT_FOLDER
```
    See Also:
    
    Constant Field Values
- Method Detail
  - processTfIdf
```
public static void processTfIdf(org.apache.hadoop.fs.Path input,
                                org.apache.hadoop.fs.Path output,
                                org.apache.hadoop.conf.Configuration baseConf,
                                Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures,
                                int minDf,
                                long maxDF,
                                float normPower,
                                boolean logNormalize,
                                boolean sequentialAccessOutput,
                                boolean namedVector,
                                int numReducers)
                         throws IOException,
                                InterruptedException,
                                ClassNotFoundException
```
    Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces. Before using this method calculateDF should be called
    
    Parameters:
    
    input - input directory of the vectors in SequenceFile format
    
    output - output directory where RandomAccessSparseVector's of the document are generated
    
    datasetFeatures - Document frequencies information calculated by calculateDF
    
    minDf - The minimum document frequency. Default 1
    
    maxDF - The max percentage of vectors for the DF. Can be used to remove really high frequency features. Expressed as an integer between 0 and 100. Default 99
    
    numReducers - The number of reducers to spawn. This also affects the possible parallelism since each reducer will typically produce a single output file containing tf-idf vectors for a subset of the documents in the corpus.
    
    Throws:
    
    IOException
    
    InterruptedException
    
    ClassNotFoundException
  - calculateDF
```
public static Pair<Long[],List<org.apache.hadoop.fs.Path>> calculateDF(org.apache.hadoop.fs.Path input,
                                                                       org.apache.hadoop.fs.Path output,
                                                                       org.apache.hadoop.conf.Configuration baseConf,
                                                                       int chunkSizeInMegabytes)
                                                                throws IOException,
                                                                       InterruptedException,
                                                                       ClassNotFoundException
```
    Calculates the document frequencies of all terms from the input set of vectors in SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.
    
    Parameters:
    
    input - input directory of the vectors in SequenceFile format
    
    output - output directory where document frequencies will be stored
    
    chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping
    
    Throws:
    
    IOException
    
    InterruptedException
    
    ClassNotFoundException

Class TFIDFConverter

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

VECTOR_COUNT

FEATURE_COUNT

MIN_DF

MAX_DF

FREQUENCY_FILE

WORDCOUNT_OUTPUT_FOLDER

Method Detail

processTfIdf

calculateDF