Broadcast variable abstraction
Block-map func
Drm block-wise tuple: Array of row keys and the matrix block.
Checkpointed DRM API.
Additional experimental operations over CheckpointedDRM implementation.
Distributed context (a.
Abstraction of optimizer/distributed engine
Basic DRM trait.
Common Drm ops
Drm row-wise tuple
Implicit broadcast -> value conversion.
Just throw all engine operations into context as well.
Compute COV(X) matrix and mean of row-wise data set.
Compute COV(X) matrix and mean of row-wise data set. X is presented as row-wise input matrix A.
This is a "wide" procedure, covariance matrix is returned as a DRM.
note: will pin input into cache if not yet pinned.
mean → covariance DRM
Thin column-wise mean and covariance matrix computation.
Thin column-wise mean and covariance matrix computation. Same as dcolMeanCov() but suited for thin and tall inputs where covariance matrix can be reduced and finalized in driver memory.
note: will pin input to cache if not yet pinned.
mean → covariance matrix (in core)
Compute column wise means and standard deviations -- distributed version.
Compute column wise means and standard deviations -- distributed version.
note: input will be pinned to cache if not yet pinned
colMeans → colStdevs
Compute column wise means and variances -- distributed version.
Compute column wise means and variances -- distributed version.
Note: will pin input to cache if not yet pinned.
colMeans → colVariances
We assume that whenever computational action is invoked without explicit checkpoint, the user doesn't imply caching
Implicit conversion to in-core with NONE caching of the result.
Convert arbitrarily-keyed matrix to int-keyed matrix.
Convert arbitrarily-keyed matrix to int-keyed matrix. Some algebra will accept only int-numbered row matrices. So this method is to help.
key type
input to be transcoded
collect old key -> int key
map to front-end?
Sequentially keyed matrix + (optionally) map from non-int key to Int key. If the key type is actually Int, then we just return the argument with None for the map, regardless of computeMap parameter.
Broadcast support API
Broadcast support API
Load DRM from hdfs (as in Mahout DRM format)
Shortcut to parallelizing matrices with indices, ignore row labels.
This creates an empty DRM with specified number of partitions and cardinality.
Creates empty DRM with non-trivial height
Parallelize in-core matrix as a distributed matrix, using row ordinal indices as data set keys.
Parallelize in-core matrix as a distributed matrix, using row labels as a data set keys.
(Optional) Sampling operation.
(Optional) Sampling operation. Consistent with Spark semantics of the same.
samples
Convert a DRM sample into a Tab Separated Vector (TSV) to be loaded into an R-DataFrame for plotting and sketching
Convert a DRM sample into a Tab Separated Vector (TSV) to be loaded into an R-DataFrame for plotting and sketching
- DRM
- Percentage of Sample elements from the DRM to be fished out for plotting
TSV String
Compute fold-in distances (distributed version).
Compute fold-in distances (distributed version). Here, we use pretty much the same math as with squared distances.
D_sq = s*1' + 1*t' - 2*X*Y'
where s is row sums of hadamard product(X, X), and, similarly, s is row sums of Hadamard product(Y, Y).
m x d row-wise dataset. Pinned to cache if not yet pinned.
n x d row-wise dataset. Pinned to cache if not yet pinned.
m x d pairwise squared distance matrix (between rows of X and Y)
Distributed Squared distance matrix computation.
CacheHint type