public final class TFIDFConverter extends Object
WritableComparable
key containing and a
VectorWritable
value containing the
term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf
formatModifier and Type | Field and Description |
---|---|
static String |
FEATURE_COUNT |
static String |
FREQUENCY_FILE |
static String |
MAX_DF |
static String |
MIN_DF |
static String |
VECTOR_COUNT |
static String |
WORDCOUNT_OUTPUT_FOLDER |
Modifier and Type | Method and Description |
---|---|
static Pair<Long[],List<org.apache.hadoop.fs.Path>> |
calculateDF(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int chunkSizeInMegabytes)
Calculates the document frequencies of all terms from the input set of vectors in
SequenceFile format. |
static void |
processTfIdf(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures,
int minDf,
long maxDF,
float normPower,
boolean logNormalize,
boolean sequentialAccessOutput,
boolean namedVector,
int numReducers)
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile format. |
public static final String VECTOR_COUNT
public static final String FEATURE_COUNT
public static final String MIN_DF
public static final String MAX_DF
public static final String FREQUENCY_FILE
public static final String WORDCOUNT_OUTPUT_FOLDER
public static void processTfIdf(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, Pair<Long[],List<org.apache.hadoop.fs.Path>> datasetFeatures, int minDf, long maxDF, float normPower, boolean logNormalize, boolean sequentialAccessOutput, boolean namedVector, int numReducers) throws IOException, InterruptedException, ClassNotFoundException
SequenceFile
format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.
Before using this method calculateDF should be calledinput
- input directory of the vectors in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generateddatasetFeatures
- Document frequencies information calculated by calculateDFminDf
- The minimum document frequency. Default 1maxDF
- The max percentage of vectors for the DF. Can be used to remove really high frequency features.
Expressed as an integer between 0 and 100. Default 99numReducers
- The number of reducers to spawn. This also affects the possible parallelism since each reducer
will typically produce a single output file containing tf-idf vectors for a subset of the
documents in the corpus.IOException
InterruptedException
ClassNotFoundException
public static Pair<Long[],List<org.apache.hadoop.fs.Path>> calculateDF(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, int chunkSizeInMegabytes) throws IOException, InterruptedException, ClassNotFoundException
SequenceFile
format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.input
- input directory of the vectors in SequenceFile
formatoutput
- output directory where document frequencies will be storedchunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swappingIOException
InterruptedException
ClassNotFoundException
Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.