CVB0Driver (Mahout Map-Reduce 0.13.0 API)

java.lang.Object
- org.apache.hadoop.conf.Configured
- - org.apache.mahout.common.AbstractJob
  - - org.apache.mahout.clustering.lda.cvb.CVB0Driver

All Implemented Interfaces:

org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
```
public class CVB0Driver
extends AbstractJob
```
See CachingCVB0Mapper for more details on scalability and room for improvement. To try out this LDA implementation without using Hadoop, check out InMemoryCollapsedVariationalBayes0. If you want to do training directly in java code with your own main(), then look to ModelTrainer and TopicModel. Usage: ./bin/mahout cvb <i>options</i>
Valid options include:

--input path
Input path for SequenceFile<IntWritable, VectorWritable> document vectors. See SparseVectorsFromSequenceFiles for details on how to generate this input format.

--dictionary path

Path to dictionary file(s) generated during construction of input document vectors (glob expression supported). If set, this data is scanned to determine an appropriate value for option --num_terms.

--output path

Output path for topic-term distributions.

--doc_topic_output path

Output path for doc-topic distributions.

--num_topics k

Number of latent topics.

--num_terms nt

Number of unique features defined by input document vectors. If option --dictionary is defined and this option is unspecified, term count is calculated from dictionary.

--topic_model_temp_dir path

Path in which to store model state after each iteration.

--maxIter i

Maximum number of iterations to perform. If this value is less than or equal to the number of iteration states found beneath the path specified by option --topic_model_temp_dir, no further iterations are performed. Instead, output topic-term and doc-topic distributions are generated using data from the specified iteration.

--max_doc_topic_iters i

Maximum number of iterations per doc for p(topic|doc) learning. Defaults to 10.

--doc_topic_smoothing a

Smoothing for doc-topic distribution. Defaults to 0.0001.

--term_topic_smoothing e

Smoothing for topic-term distribution. Defaults to 0.0001.

--random_seed seed

Integer seed for random number generation.

--test_set_percentage p

Fraction of data to hold out for testing. Defaults to 0.0.

--iteration_block_size block

Number of iterations between perplexity checks. Defaults to 10. This option is ignored unless option --test_set_percentage is greater than zero.

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class CVB0Driver.DualDoubleSumReducer
Sums keys and values independently.

Nested Classes
Modifier and Type	Class and Description
`static class`	`CVB0Driver.DualDoubleSumReducer` Sums keys and values independently.

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`BACKFILL_PERPLEXITY`
`static String`	`DICTIONARY`
`static String`	`DOC_TOPIC_OUTPUT`
`static String`	`DOC_TOPIC_SMOOTHING`
`static String`	`ITERATION_BLOCK_SIZE`
`static String`	`MAX_ITERATIONS_PER_DOC`
`static String`	`MODEL_TEMP_DIR`
`static String`	`MODEL_WEIGHT`
`static String`	`NUM_REDUCE_TASKS`
`static String`	`NUM_TERMS`
`static String`	`NUM_TOPICS`
`static String`	`NUM_TRAIN_THREADS`
`static String`	`NUM_UPDATE_THREADS`
`static String`	`RANDOM_SEED`
`static String`	`TERM_TOPIC_SMOOTHING`
`static String`	`TEST_SET_FRACTION`

Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath

Constructor Summary

Constructors
Constructor and Description

CVB0Driver()

Constructors
Constructor and Description
`CVB0Driver()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static org.apache.hadoop.fs.Path[]`	`getModelPaths(org.apache.hadoop.conf.Configuration conf)`
`static void`	`main(String[] args)`
`static org.apache.hadoop.fs.Path`	`modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath, int iterationNumber)`
`static org.apache.hadoop.fs.Path`	`perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath, int iterationNumber)`
`static double`	`readPerplexity(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path topicModelStateTemp, int iteration)`
`int`	run(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inputPath, org.apache.hadoop.fs.Path topicModelOutputPath, int numTopics, int numTerms, double alpha, double eta, int maxIterations, int iterationBlockSize, double convergenceDelta, org.apache.hadoop.fs.Path dictionaryPath, org.apache.hadoop.fs.Path docTopicOutputPath, org.apache.hadoop.fs.Path topicModelStateTempPath, long randomSeed, float testFraction, int numTrainThreads, int numUpdateThreads, int maxItersPerDoc, int numReduceTasks, boolean backfillPerplexity)
`int`	`run(String[] args)`
`void`	`runIteration(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path corpusInput, org.apache.hadoop.fs.Path modelInput, org.apache.hadoop.fs.Path modelOutput, int iterationNumber, int maxIterations, int numReduceTasks)`

Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

NUM_TOPICS
```
public static final String NUM_TOPICS
```
See Also:

Constant Field Values

NUM_TERMS
```
public static final String NUM_TERMS
```
See Also:

Constant Field Values

DOC_TOPIC_SMOOTHING

public static final String DOC_TOPIC_SMOOTHING

See Also:: Constant Field Values

TERM_TOPIC_SMOOTHING

public static final String TERM_TOPIC_SMOOTHING

See Also:: Constant Field Values

DICTIONARY
```
public static final String DICTIONARY
```
See Also:

Constant Field Values

DOC_TOPIC_OUTPUT

public static final String DOC_TOPIC_OUTPUT

See Also:: Constant Field Values

MODEL_TEMP_DIR

public static final String MODEL_TEMP_DIR

See Also:: Constant Field Values

ITERATION_BLOCK_SIZE

public static final String ITERATION_BLOCK_SIZE

See Also:: Constant Field Values

RANDOM_SEED
```
public static final String RANDOM_SEED
```
See Also:

Constant Field Values

TEST_SET_FRACTION

public static final String TEST_SET_FRACTION

See Also:: Constant Field Values

NUM_TRAIN_THREADS

public static final String NUM_TRAIN_THREADS

See Also:: Constant Field Values

NUM_UPDATE_THREADS

public static final String NUM_UPDATE_THREADS

See Also:: Constant Field Values

MAX_ITERATIONS_PER_DOC

public static final String MAX_ITERATIONS_PER_DOC

See Also:: Constant Field Values

MODEL_WEIGHT

public static final String MODEL_WEIGHT

See Also:: Constant Field Values

NUM_REDUCE_TASKS

public static final String NUM_REDUCE_TASKS

See Also:: Constant Field Values

BACKFILL_PERPLEXITY

public static final String BACKFILL_PERPLEXITY

See Also:: Constant Field Values

Constructor Detail
- CVB0Driver
```
public CVB0Driver()
```

Method Detail

run

public int run(String[] args)
        throws Exception

Throws:: Exception

run

public int run(org.apache.hadoop.conf.Configuration conf,
               org.apache.hadoop.fs.Path inputPath,
               org.apache.hadoop.fs.Path topicModelOutputPath,
               int numTopics,
               int numTerms,
               double alpha,
               double eta,
               int maxIterations,
               int iterationBlockSize,
               double convergenceDelta,
               org.apache.hadoop.fs.Path dictionaryPath,
               org.apache.hadoop.fs.Path docTopicOutputPath,
               org.apache.hadoop.fs.Path topicModelStateTempPath,
               long randomSeed,
               float testFraction,
               int numTrainThreads,
               int numUpdateThreads,
               int maxItersPerDoc,
               int numReduceTasks,
               boolean backfillPerplexity)
        throws ClassNotFoundException,
               IOException,
               InterruptedException

Throws:: ClassNotFoundException; IOException; InterruptedException

readPerplexity
```
public static double readPerplexity(org.apache.hadoop.conf.Configuration conf,
                                    org.apache.hadoop.fs.Path topicModelStateTemp,
                                    int iteration)
                             throws IOException
```
Parameters:

topicModelStateTemp -

iteration -

Returns:

double[2] where first value is perplexity and second is model weight of those documents sampled during perplexity computation, or null if no perplexity data exists for the given iteration.

Throws:

IOException

modelPath

public static org.apache.hadoop.fs.Path modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
                                                  int iterationNumber)

perplexityPath

public static org.apache.hadoop.fs.Path perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
                                                       int iterationNumber)

runIteration

public void runIteration(org.apache.hadoop.conf.Configuration conf,
                         org.apache.hadoop.fs.Path corpusInput,
                         org.apache.hadoop.fs.Path modelInput,
                         org.apache.hadoop.fs.Path modelOutput,
                         int iterationNumber,
                         int maxIterations,
                         int numReduceTasks)
                  throws IOException,
                         ClassNotFoundException,
                         InterruptedException

Throws:: IOException; ClassNotFoundException; InterruptedException

getModelPaths

public static org.apache.hadoop.fs.Path[] getModelPaths(org.apache.hadoop.conf.Configuration conf)

main

public static void main(String[] args)
                 throws Exception

Throws:: Exception

Class CVB0Driver

Nested Class Summary

Field Summary

Fields inherited from class org.apache.mahout.common.AbstractJob

Constructor Summary

Method Summary

Methods inherited from class org.apache.mahout.common.AbstractJob

Methods inherited from class java.lang.Object

Field Detail

NUM_TOPICS

NUM_TERMS

DOC_TOPIC_SMOOTHING

TERM_TOPIC_SMOOTHING

DICTIONARY

DOC_TOPIC_OUTPUT

MODEL_TEMP_DIR

ITERATION_BLOCK_SIZE

RANDOM_SEED

TEST_SET_FRACTION

NUM_TRAIN_THREADS

NUM_UPDATE_THREADS

MAX_ITERATIONS_PER_DOC

MODEL_WEIGHT

NUM_REDUCE_TASKS

BACKFILL_PERPLEXITY

Constructor Detail

CVB0Driver

Method Detail

run

run

readPerplexity

modelPath

perplexityPath

runIteration

getModelPaths

main