DocumentProcessor (Mahout Map-Reduce 0.13.0 API)

java.lang.Object
- org.apache.mahout.vectorizer.DocumentProcessor

```
public final class DocumentProcessor
extends Object
```
This class converts a set of input documents in the sequence file format of StringTuples.The SequenceFile input should have a Text key containing the unique document identifier and a Text value containing the whole document. The document should be stored in UTF-8 encoding which is recognizable by hadoop. It uses the given Analyzer to process the document into Tokens.

Field Summary

Fields
Modifier and Type Field and Description

static String ANALYZER_CLASS

static String TOKENIZED_DOCUMENT_OUTPUT_FOLDER

Fields
Modifier and Type	Field and Description
`static String`	`ANALYZER_CLASS`
`static String`	`TOKENIZED_DOCUMENT_OUTPUT_FOLDER`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static void`	`tokenizeDocuments(org.apache.hadoop.fs.Path input, Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf)` Convert the input documents into token array using the `StringTuple` The input documents has to be in the `SequenceFile` format

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- TOKENIZED_DOCUMENT_OUTPUT_FOLDER
```
public static final String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
```
  See Also:
  
  Constant Field Values
- ANALYZER_CLASS
```
public static final String ANALYZER_CLASS
```
  See Also:
  
  Constant Field Values

Method Detail

tokenizeDocuments

public static void tokenizeDocuments(org.apache.hadoop.fs.Path input,
                                     Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
                                     org.apache.hadoop.fs.Path output,
                                     org.apache.hadoop.conf.Configuration baseConf)
                              throws IOException,
                                     InterruptedException,
                                     ClassNotFoundException

Convert the input documents into token array using the StringTuple The input documents has to be in the SequenceFile format

Parameters:: input - input directory of the documents in SequenceFile format; output - output directory were the StringTuple token array of each document has to be created; analyzerClass - The Lucene Analyzer for tokenizing the UTF-8 text
Throws:: IOException; InterruptedException; ClassNotFoundException

Class DocumentProcessor

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

TOKENIZED_DOCUMENT_OUTPUT_FOLDER

ANALYZER_CLASS

Method Detail

tokenizeDocuments