Skip navigation links

Package org.apache.mahout.math.list

Resizable lists holding objects or primitive data types such as int, double, etc.

See: Description

Package org.apache.mahout.math.list Description

Resizable lists holding objects or primitive data types such as int, double, etc. For non-resizable lists (1-dimensional matrices) see package org.apache.mahout.math.matrix.

Getting Started

1. Overview

The list package offers flexible object oriented abstractions modelling dynamically resizing lists holding objects or primitive data types such as int, double, etc. It is designed to be scalable in terms of performance and memory requirements.

Features include:

File-based I/O can be achieved through the standard Java built-in serialization mechanism. All classes implement the Serializable interface. However, the toolkit is entirely decoupled from advanced I/O. It provides data structures and algorithms only.

This toolkit borrows concepts and terminology from the Javasoft Collections framework written by Josh Bloch and introduced in JDK 1.2.

2. Introduction

Lists are fundamental to virtually any application. Large scale resizable lists are, for example, used in scientific computations, simulations database management systems, to name just a few.

A list is a container holding elements that can be accessed via zero-based indexes. Lists may be implemented in different ways (most commonly with arrays). A resizable list automatically grows as elements are added. The lists of this package do not automatically shrink. Shrinking needs to be triggered by explicitly calling trimToSize() methods.

Growing policy: A list implemented with arrays initially has a certain initialCapacity - per default 10 elements, but customizable upon instance construction. As elements are added, this capacity may nomore be sufficient. When a list is automatically grown, its capacity is expanded to 1.5*currentCapacity. Thus, excessive resizing (involving copying) is avoided.

Copying

Any list can be copied. A copy is equal to the original but entirely independent of the original. So changes in the copy are not reflected in the original, and vice-versa.

3. Organization of this package

Class naming follows the schema <ElementType><ImplementationTechnique>List. For example, we have a DoubleArrayList, which is a list holding double elements implemented with double[] arrays.

The classes for lists of a given value type are derived from a common abstract base class tagged Abstract<ElementType>List. For example, all lists operating on double elements are derived from AbstractDoubleList, which in turn is derived from an abstract base class tying together all lists regardless of value type, AbstractList. The abstract base classes provide skeleton implementations for all but few methods. Experimental data layouts (such as compressed, sparse, linked, etc.) can easily be implemented and inherit a rich set of functionality. Have a look at the javadoc tree view to get the broad picture.

4. Example usage

The following snippet fills a list, randomizes it, extracts the first half of the elements, sums them up and prints the result. It is implemented entirely with accessor methods.

 int s = 1000000;
AbstractDoubleList list = new DoubleArrayList(); for (int i=0; i<s; i++) { list.add((double)i); } list.shuffle(); AbstractDoubleList part = list.partFromTo(0,list.size()/2 - 1); double sum = 0.0; for (int i=0; i<part.size(); i++) { sum += part.get(i); } log.info(sum);

For efficiency, all classes provide back doors to enable getting/setting the backing array directly. In this way, the high level operations of these classes can be used where appropriate, and one can switch to []-array index notations where necessary. The key methods for this are public <ElementType>[] elements() and public void elements(<ElementType>[]). The former trustingly returns the array it internally keeps to store the elements. Holding this array in hand, we can use the []-array operator to perform iteration over large lists without needing to copy the array or paying the performance penalty introduced by accessor methods. Alternatively any JAL algorithm (or other algorithm) can operate on the returned primitive array. The latter method forces a list to internally hold a user provided array. Using this approach one can avoid needing to copy the elements into the list.

As a consequence, operations on primitive arrays, Colt lists and JAL algorithms can freely be mixed at zero-copy overhead.

Note that such special treatment certainly breaks encapsulation. This functionality is provided for performance reasons only and should only be used when absolutely necessary. Here is the above example in mixed notation:

 int s = 1000000;
DoubleArrayList list = new DoubleArrayList(s); // list.size()==0, capacity==s list.setSize(s); // list.size()==s
double[] values = list.elements(); // zero copy, values.length==s
for (int i=0; i<s; i++) { values[i]=(double)i; } list.shuffle(); double sum = 0.0; int limit = values.length/2; for (int i=0; i<limit; i++) { sum += values[i]; } log.info(sum);

Or even more compact using lists as algorithm objects:

 int s = 1000000;
double[] values = new double[s]; for (int i=0; i<s; i++) { values[i]=(double)i; } new DoubleArrayList(values).shuffle(); // zero-copy, shuffle via back door double sum = 0.0; int limit = values.length/2; for (int i=0; i<limit; i++) { sum += values[i]; } log.info(sum);

5. Notes

The quicksorts and mergesorts are the JDK 1.2 V1.26 algorithms, modified as necessary to operate on the given data types.

Skip navigation links

Copyright © 2008–2017 The Apache Software Foundation. All rights reserved.