Intro

Most ML algorithms require the ability to represent multidimensional data concisely and to be able to easily perform common operations on that data. MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality, along with a set of common operations on their instances. Vectors and matrices are provided with sparse and dense implementations that are memory resident and are suitable for manipulating intermediate results within mapper, combiner and reducer implementations. They are not intended for applications requiring vectors or matrices that exceed the size of a single JVM, though such applications might be able to utilize them within a larger organizing framework.

Background

See http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser

Vectors

Mahout supports a Vector interface that defines the following operations over all implementation classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double that is storage and access efficient. The class SparseVector implements vectors as a HashMap<Integer, Double> that is surprisingly fast and efficient. For sparse vectors, the size() method returns the current number of elements whereas the cardinality() method returns the number of dimensions it holds. An additional VectorView class allows views of an underlying vector to be specified by the viewPart() method. See the JavaDocs for more complete definitions.

Matrices

Mahout also supports a Matrix interface that defines a similar set of operations over all implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements matrices as a double [] that is storage and access efficient. The class SparseRowMatrix implements matrices as a Vector[] holding the rows of the matrix in a SparseVector, and the symmetric class SparseColumnMatrix implements matrices as a Vector[] holding the columns in a SparseVector. Each of these classes can quickly produce a given row or column, respectively. A fourth class SparseMatrix, uses a HashMap<Integer, Vector> which is also a SparseVector. For sparse matrices, the size() method returns an int[2] containing the actual row and column sizes whereas the cardinality() method returns an int[2] with the number of dimensions of each. An additional MatrixView class allows views of an underlying matrix to be specified by the viewPart() method. See the JavaDocs for more complete definitions.

The Matrix interface does not currently provide invert or determinant methods, though these are desirable. It is arguable that the implementations of SparseRowMatrix and SparseColumnMatrix ought to use the HashMap<Integer, Vector> implementations and that SparseMatrix should instead use a HashMap<Integer, HashMap<Integer, DoubleĀ». Other forms of sparse matrices can also be envisioned that support different storage and access characteristics. Because the arguments of assignColumn and assignRow operations accept all forms of Vector, it is possible to construct instances of sparse matrices containing dense rows or columns. See the JavaDocs for more complete definitions.

For applications like PageRank/TextRank, iterative approaches to calculate eigenvectors would also be useful. Batching of row/column operations would also be useful, such as perhaps assignRow or assighColumn accepting UnaryFunction and BinaryFunction arguments.

Ideas

As Vector and Matrix implementations are currently memory-resident, very large instances greater than available memory are not supported. An extended set of implementations that use HBase (BigTable) in Hadoop to represent their instances would facilitate applications requiring such large collections.
See MAHOUT-6 See Hama

References

Have a look at the old parallel computing libraries like ScalaPACK , others