public final class DocumentProcessor extends Object
StringTuple
s.The
SequenceFile
input should have a Text
key
containing the unique document identifier and a
Text
value containing the whole document. The document should be stored in UTF-8 encoding which is
recognizable by hadoop. It uses the given Analyzer
to process the document into
Token
s.Modifier and Type | Field and Description |
---|---|
static String |
ANALYZER_CLASS |
static String |
TOKENIZED_DOCUMENT_OUTPUT_FOLDER |
Modifier and Type | Method and Description |
---|---|
static void |
tokenizeDocuments(org.apache.hadoop.fs.Path input,
Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf)
Convert the input documents into token array using the
StringTuple The input documents has to be
in the SequenceFile format |
public static final String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
public static final String ANALYZER_CLASS
public static void tokenizeDocuments(org.apache.hadoop.fs.Path input, Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf) throws IOException, InterruptedException, ClassNotFoundException
StringTuple
The input documents has to be
in the SequenceFile
formatinput
- input directory of the documents in SequenceFile
formatoutput
- output directory were the StringTuple
token array of each document has to be createdanalyzerClass
- The Lucene Analyzer
for tokenizing the UTF-8 textIOException
InterruptedException
ClassNotFoundException
Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.