public class CVB0Driver extends AbstractJob
CachingCVB0Mapper
for more details on scalability and room for improvement.
To try out this LDA implementation without using Hadoop, check out
InMemoryCollapsedVariationalBayes0
. If you want to do training directly in java code
with your own main(), then look to ModelTrainer
and TopicModel
.
Usage: ./bin/mahout cvb <i>options</i>
Valid options include:
--input path
SequenceFile<IntWritable, VectorWritable>
document vectors. See
SparseVectorsFromSequenceFiles
for details on how to generate this input format.--dictionary path
--num_terms
.--output path
--doc_topic_output path
--num_topics k
--num_terms nt
--dictionary
is defined and this option is unspecified, term count is calculated from dictionary.--topic_model_temp_dir path
--maxIter i
--topic_model_temp_dir
, no
further iterations are performed. Instead, output topic-term and doc-topic distributions are
generated using data from the specified iteration.--max_doc_topic_iters i
10
.--doc_topic_smoothing a
0.0001
.--term_topic_smoothing e
0.0001
.--random_seed seed
--test_set_percentage p
0.0
.--iteration_block_size block
10
. This option is
ignored unless option --test_set_percentage
is greater than zero.Modifier and Type | Class and Description |
---|---|
static class |
CVB0Driver.DualDoubleSumReducer
Sums keys and values independently.
|
Modifier and Type | Field and Description |
---|---|
static String |
BACKFILL_PERPLEXITY |
static String |
DICTIONARY |
static String |
DOC_TOPIC_OUTPUT |
static String |
DOC_TOPIC_SMOOTHING |
static String |
ITERATION_BLOCK_SIZE |
static String |
MAX_ITERATIONS_PER_DOC |
static String |
MODEL_TEMP_DIR |
static String |
MODEL_WEIGHT |
static String |
NUM_REDUCE_TASKS |
static String |
NUM_TERMS |
static String |
NUM_TOPICS |
static String |
NUM_TRAIN_THREADS |
static String |
NUM_UPDATE_THREADS |
static String |
RANDOM_SEED |
static String |
TERM_TOPIC_SMOOTHING |
static String |
TEST_SET_FRACTION |
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
Constructor and Description |
---|
CVB0Driver() |
Modifier and Type | Method and Description |
---|---|
static org.apache.hadoop.fs.Path[] |
getModelPaths(org.apache.hadoop.conf.Configuration conf) |
static void |
main(String[] args) |
static org.apache.hadoop.fs.Path |
modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
int iterationNumber) |
static org.apache.hadoop.fs.Path |
perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath,
int iterationNumber) |
static double |
readPerplexity(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path topicModelStateTemp,
int iteration) |
int |
run(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputPath,
org.apache.hadoop.fs.Path topicModelOutputPath,
int numTopics,
int numTerms,
double alpha,
double eta,
int maxIterations,
int iterationBlockSize,
double convergenceDelta,
org.apache.hadoop.fs.Path dictionaryPath,
org.apache.hadoop.fs.Path docTopicOutputPath,
org.apache.hadoop.fs.Path topicModelStateTempPath,
long randomSeed,
float testFraction,
int numTrainThreads,
int numUpdateThreads,
int maxItersPerDoc,
int numReduceTasks,
boolean backfillPerplexity) |
int |
run(String[] args) |
void |
runIteration(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path corpusInput,
org.apache.hadoop.fs.Path modelInput,
org.apache.hadoop.fs.Path modelOutput,
int iterationNumber,
int maxIterations,
int numReduceTasks) |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
public static final String NUM_TOPICS
public static final String NUM_TERMS
public static final String DOC_TOPIC_SMOOTHING
public static final String TERM_TOPIC_SMOOTHING
public static final String DICTIONARY
public static final String DOC_TOPIC_OUTPUT
public static final String MODEL_TEMP_DIR
public static final String ITERATION_BLOCK_SIZE
public static final String RANDOM_SEED
public static final String TEST_SET_FRACTION
public static final String NUM_TRAIN_THREADS
public static final String NUM_UPDATE_THREADS
public static final String MAX_ITERATIONS_PER_DOC
public static final String MODEL_WEIGHT
public static final String NUM_REDUCE_TASKS
public static final String BACKFILL_PERPLEXITY
public int run(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path inputPath, org.apache.hadoop.fs.Path topicModelOutputPath, int numTopics, int numTerms, double alpha, double eta, int maxIterations, int iterationBlockSize, double convergenceDelta, org.apache.hadoop.fs.Path dictionaryPath, org.apache.hadoop.fs.Path docTopicOutputPath, org.apache.hadoop.fs.Path topicModelStateTempPath, long randomSeed, float testFraction, int numTrainThreads, int numUpdateThreads, int maxItersPerDoc, int numReduceTasks, boolean backfillPerplexity) throws ClassNotFoundException, IOException, InterruptedException
public static double readPerplexity(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path topicModelStateTemp, int iteration) throws IOException
topicModelStateTemp
- iteration
- double[2]
where first value is perplexity and second is model weight of those
documents sampled during perplexity computation, or null
if no perplexity data
exists for the given iteration.IOException
public static org.apache.hadoop.fs.Path modelPath(org.apache.hadoop.fs.Path topicModelStateTempPath, int iterationNumber)
public static org.apache.hadoop.fs.Path perplexityPath(org.apache.hadoop.fs.Path topicModelStateTempPath, int iterationNumber)
public void runIteration(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path corpusInput, org.apache.hadoop.fs.Path modelInput, org.apache.hadoop.fs.Path modelOutput, int iterationNumber, int maxIterations, int numReduceTasks) throws IOException, ClassNotFoundException, InterruptedException
public static org.apache.hadoop.fs.Path[] getModelPaths(org.apache.hadoop.conf.Configuration conf)
Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.