Scala & Spark Bindings:¶
Bringing algebraic semantics
What is Scala & Spark Bindings?¶
In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from (d)spca)
\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}\mathbf{C}\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]
bound to incore and distributed computations (currently, on Apache Spark).
Mahout Scala & Spark Bindings expression of the above:
val g = bt.t %*% bt  c  c.t + (s_q cross s_q) * (xi dot xi)
The main idea is that a scientist writing algebraic expressions cannot care less of distributed operation plans and works entirely on the logical level just like he or she would do with R.
Another idea is decoupling logical expression from distributed backend. As more backends are added, this implies "write once, run everywhere".
The linear algebra side works with scalars, incore vectors and matrices, and Mahout Distributed Row Matrices (DRMs).
The ecosystem of operators is built in the R's image, i.e. it follows R naming such as %*%, colSums, nrow, length operating over vectors or matices.
Important part of Spark Bindings is expression optimizer. It looks at expression as a whole and figures out how it can be simplified, and which physical operators should be picked. For example, there are currently about 5 different physical operators performing DRMDRM multiplication picked based on matrix geometry, distributed dataset partitioning, orientation etc. If we count in DRM by incore combinations, that would be another 4, i.e. 9 total  all of it for just simple x %*% y logical notation.
Please refer to the documentation for details.
Status¶
This environment addresses mostly Rlike Linear Algebra optmizations for Spark, Flink and H20.
Documentation¶
Distributed methods and solvers using Bindings¶
 Incore (ssvd) and Distributed (dssvd) Stochastic SVD  guinea pigs  see the bindings manual
 Incore (spca) and Distributed (dspca) Stochastic PCA  guinea pigs  see the bindings manual
 Distributed thin QR decomposition (dqrThin)  guinea pig  see the bindings manual
 Current list of algorithms
Related history of note¶
 CLI and Driver for Spark version of item similarity  MAHOUT1541
 Command line interface for generalizable Spark pipelines  MAHOUT1569
 Cooccurrence Analysis / Itembased Recommendation  MAHOUT1464
 Spark Bindings  MAHOUT1346
 Scala Bindings  MAHOUT1297
 Interactive Scala & Spark Bindings Shell & Script processor  MAHOUT1489
 OLS tutorial using Mahout shell  MAHOUT1542
 Full abstraction of DRM apis and algorithms from a distributed engine  MAHOUT1529
 Port Naive Bayes  MAHOUT1493
Work in progress¶

Textdelimited files for input and output  MAHOUT1568

Your issue here!