public final class DictionaryVectorizer extends AbstractJob implements Vectorizer
Text
key containing the unique document identifier and a StringTuple
value containing the tokenized document. You may use DocumentProcessor
to tokenize the document.
This is a dictionary based Vectorizer.Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_MIN_SUPPORT |
static String |
DICTIONARY_FILE |
static String |
DOCUMENT_VECTOR_OUTPUT_FOLDER |
static String |
MAX_NGRAMS |
static String |
MIN_SUPPORT |
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
Modifier and Type | Method and Description |
---|---|
static void |
createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
String tfVectorsFolderName,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
Create Term Frequency (Tf) Vectors from the input set of documents in
SequenceFile format. |
void |
createVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
VectorizerConfig config) |
static void |
main(String[] args) |
int |
run(String[] args) |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER
public static final String MIN_SUPPORT
public static final String MAX_NGRAMS
public static final int DEFAULT_MIN_SUPPORT
public static final String DICTIONARY_FILE
public void createVectors(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, VectorizerConfig config) throws IOException, ClassNotFoundException, InterruptedException
createVectors
in interface Vectorizer
IOException
ClassNotFoundException
InterruptedException
public static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, String tfVectorsFolderName, org.apache.hadoop.conf.Configuration baseConf, int minSupport, int maxNGramSize, float minLLRValue, float normPower, boolean logNormalize, int numReducers, int chunkSizeInMegabytes, boolean sequentialAccess, boolean namedVectors) throws IOException, InterruptedException, ClassNotFoundException
SequenceFile
format. This
tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across
multiple map/reduces.input
- input directory of the documents in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generatedtfVectorsFolderName
- The name of the folder in which the final output vectors will be storedbaseConf
- job configurationminSupport
- the minimum frequency of the feature in the entire corpus to be considered for inclusion in the
sparse vectormaxNGramSize
- 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigramminLLRValue
- minValue of log likelihood ratio to used to prune ngramsnormPower
- L_p norm to be computedlogNormalize
- whether to use log normalizationnumReducers
- chunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swappingsequentialAccess
- namedVectors
- IOException
InterruptedException
ClassNotFoundException
public int run(String[] args) throws Exception
run
in interface org.apache.hadoop.util.Tool
Exception
Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.