public class ModelTrainer extends Object
TopicModel
and use it to iteratively learn the p(topic|term, doc)
distribution for documents (this can be done in parallel across many documents, as the
"read-only" model is, well, read-only. Then the outputs of this are "reduced" onto the
"write" model, and these updates are not parallelizable in the same way: individual
documents can't be added to the same entries in different threads at the same time, but
updates across many topics to the same term from the same document can be done in parallel,
so they are.
Because computation is done asynchronously, when iteration is done, it's important to call
the stop() method, which blocks until work is complete.
Setting the read model and the write model to be the same object may not quite work yet,
on account of parallelism badness.Constructor and Description |
---|
ModelTrainer(TopicModel model,
int numTrainThreads,
int numTopics,
int numTerms)
WARNING: this constructor may not lead to good behavior.
|
ModelTrainer(TopicModel initialReadModel,
TopicModel initialWriteModel,
int numTrainThreads,
int numTopics,
int numTerms) |
Modifier and Type | Method and Description |
---|---|
void |
batchTrain(Map<Vector,Vector> batch,
boolean update,
int numDocTopicsIters) |
double |
calculatePerplexity(VectorIterable matrix,
VectorIterable docTopicCounts) |
double |
calculatePerplexity(VectorIterable matrix,
VectorIterable docTopicCounts,
double testFraction) |
double |
calculatePerplexity(Vector document,
Vector docTopicCounts,
int numDocTopicIters) |
TopicModel |
getReadModel() |
void |
persist(org.apache.hadoop.fs.Path outputPath) |
void |
start() |
void |
stop() |
void |
train(VectorIterable matrix,
VectorIterable docTopicCounts) |
void |
train(VectorIterable matrix,
VectorIterable docTopicCounts,
int numDocTopicIters) |
void |
train(Vector document,
Vector docTopicCounts,
boolean update,
int numDocTopicIters) |
void |
trainSync(Vector document,
Vector docTopicCounts,
boolean update,
int numDocTopicIters) |
public ModelTrainer(TopicModel initialReadModel, TopicModel initialWriteModel, int numTrainThreads, int numTopics, int numTerms)
public ModelTrainer(TopicModel model, int numTrainThreads, int numTopics, int numTerms)
model
- to be used for both reading (inference) and accumulating (learning)numTrainThreads
- numTopics
- numTerms
- public TopicModel getReadModel()
public void start()
public void train(VectorIterable matrix, VectorIterable docTopicCounts)
public double calculatePerplexity(VectorIterable matrix, VectorIterable docTopicCounts)
public double calculatePerplexity(VectorIterable matrix, VectorIterable docTopicCounts, double testFraction)
public void train(VectorIterable matrix, VectorIterable docTopicCounts, int numDocTopicIters)
public void train(Vector document, Vector docTopicCounts, boolean update, int numDocTopicIters)
public void trainSync(Vector document, Vector docTopicCounts, boolean update, int numDocTopicIters)
public double calculatePerplexity(Vector document, Vector docTopicCounts, int numDocTopicIters)
public void stop()
public void persist(org.apache.hadoop.fs.Path outputPath) throws IOException
IOException
Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.