# Scala & Spark Bindings:

*Bringing algebraic semantics*

## What is Scala & Spark Bindings?

In short, Scala & Spark Bindings for Mahout is Scala DSL and algebraic optimizer of something like this (actual formula from **(d)spca**)

`\[\mathbf{G}=\mathbf{B}\mathbf{B}^{\top}-\mathbf{C}-\mathbf{C}^{\top}+\mathbf{s}_{q}\mathbf{s}_{q}^{\top}\boldsymbol{\xi}^{\top}\boldsymbol{\xi}\]`

bound to in-core and distributed computations (currently, on Apache Spark).

Mahout Scala & Spark Bindings expression of the above:

```
val g = bt.t %*% bt - c - c.t + (s_q cross s_q) * (xi dot xi)
```

The main idea is that a scientist writing algebraic expressions cannot care less of distributed
operation plans and works **entirely on the logical level** just like he or she would do with R.

Another idea is decoupling logical expression from distributed back-end. As more back-ends are added,
this implies **“write once, run everywhere”**.

The linear algebra side works with scalars, in-core vectors and matrices, and Mahout Distributed
Row Matrices (DRMs).

The ecosystem of operators is built in the R’s image, i.e. it follows R naming such as %*%,
colSums, nrow, length operating over vectors or matices.

Important part of Spark Bindings is expression optimizer. It looks at expression as a whole
and figures out how it can be simplified, and which physical operators should be picked. For example,
there are currently about 5 different physical operators performing DRM-DRM multiplication
picked based on matrix geometry, distributed dataset partitioning, orientation etc.
If we count in DRM by in-core combinations, that would be another 4, i.e. 9 total – all of it for just
simple x %*% y logical notation.

Please refer to the documentation for details.

## Status

This environment addresses mostly R-like Linear Algebra optmizations for
Spark, Flink and H20.

## Documentation

- Scala and Spark bindings manual: web, pdf
- Overview blog on 0.10.x releases: blog

## Distributed methods and solvers using Bindings

- In-core (ssvd) and Distributed (dssvd) Stochastic SVD – guinea pigs – see the bindings manual
- In-core (spca) and Distributed (dspca) Stochastic PCA – guinea pigs – see the bindings manual
- Distributed thin QR decomposition (dqrThin) – guinea pig – see the bindings manual
- Current list of algorithms

## Related history of note

## Work in progress