DictionaryVectorizer (Mahout Map-Reduce 0.13.0 API)

java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.mahout.common.AbstractJob
  - - org.apache.mahout.vectorizer.DictionaryVectorizer

All Implemented Interfaces:

org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool, Vectorizer
```
public final class DictionaryVectorizer
extends AbstractJob
implements Vectorizer
```
This class converts a set of input documents in the sequence file format to vectors. The Sequence file input should have a Text key containing the unique document identifier and a StringTuple value containing the tokenized document. You may use DocumentProcessor to tokenize the document. This is a dictionary based Vectorizer.

Field Summary

Fields
Modifier and Type	Field and Description
`static int`	`DEFAULT_MIN_SUPPORT`
`static String`	`DICTIONARY_FILE`
`static String`	`DOCUMENT_VECTOR_OUTPUT_FOLDER`
`static String`	`MAX_NGRAMS`
`static String`	`MIN_SUPPORT`

Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`createTermFrequencyVectors(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, String tfVectorsFolderName, org.apache.hadoop.conf.Configuration baseConf, int minSupport, int maxNGramSize, float minLLRValue, float normPower, boolean logNormalize, int numReducers, int chunkSizeInMegabytes, boolean sequentialAccess, boolean namedVectors)` Create Term Frequency (Tf) Vectors from the input set of documents in `SequenceFile` format.
`void`	`createVectors(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, VectorizerConfig config)`
`static void`	`main(String[] args)`
`int`	`run(String[] args)`

Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- DOCUMENT_VECTOR_OUTPUT_FOLDER
```
public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER
```
  See Also:
  
  Constant Field Values
- MIN_SUPPORT
```
public static final String MIN_SUPPORT
```
  See Also:
  
  Constant Field Values
- MAX_NGRAMS
```
public static final String MAX_NGRAMS
```
  See Also:
  
  Constant Field Values
- DEFAULT_MIN_SUPPORT
```
public static final int DEFAULT_MIN_SUPPORT
```
  See Also:
  
  Constant Field Values
- DICTIONARY_FILE
```
public static final String DICTIONARY_FILE
```
  See Also:
  
  Constant Field Values

Method Detail

createVectors

public void createVectors(org.apache.hadoop.fs.Path input,
                          org.apache.hadoop.fs.Path output,
                          VectorizerConfig config)
                   throws IOException,
                          ClassNotFoundException,
                          InterruptedException

Specified by:: createVectors in interface Vectorizer
Throws:: IOException; ClassNotFoundException; InterruptedException

createTermFrequencyVectors
```
public static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
                                              org.apache.hadoop.fs.Path output,
                                              String tfVectorsFolderName,
                                              org.apache.hadoop.conf.Configuration baseConf,
                                              int minSupport,
                                              int maxNGramSize,
                                              float minLLRValue,
                                              float normPower,
                                              boolean logNormalize,
                                              int numReducers,
                                              int chunkSizeInMegabytes,
                                              boolean sequentialAccess,
                                              boolean namedVectors)
                                       throws IOException,
                                              InterruptedException,
                                              ClassNotFoundException
```
Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format. This tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.

Parameters:

input - input directory of the documents in SequenceFile format

output - output directory where RandomAccessSparseVector's of the document are generated

tfVectorsFolderName - The name of the folder in which the final output vectors will be stored

baseConf - job configuration

minSupport - the minimum frequency of the feature in the entire corpus to be considered for inclusion in the sparse vector

maxNGramSize - 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigram

minLLRValue - minValue of log likelihood ratio to used to prune ngrams

normPower - L_p norm to be computed

logNormalize - whether to use log normalization

numReducers -

chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping

sequentialAccess -

namedVectors -

Throws:

IOException

InterruptedException

ClassNotFoundException

run
```
public int run(String[] args)
        throws Exception
```
Specified by:

run in interface org.apache.hadoop.util.Tool

Throws:

Exception

main

public static void main(String[] args)
                 throws Exception

Throws:: Exception

Class DictionaryVectorizer

Field Summary

Fields inherited from class org.apache.mahout.common.AbstractJob

Method Summary

Methods inherited from class org.apache.mahout.common.AbstractJob

Methods inherited from class java.lang.Object

Field Detail

DOCUMENT_VECTOR_OUTPUT_FOLDER

MIN_SUPPORT

MAX_NGRAMS

DEFAULT_MIN_SUPPORT

DICTIONARY_FILE

Method Detail

createVectors

createTermFrequencyVectors

run

main