org.apache.mahout.math.map (Mahout Math 0.13.0 API)

Class Summary
Class	Description
AbstractByteByteMap
AbstractByteCharMap
AbstractByteDoubleMap
AbstractByteFloatMap
AbstractByteIntMap
AbstractByteLongMap
AbstractByteObjectMap<T>
AbstractByteShortMap
AbstractCharByteMap
AbstractCharCharMap
AbstractCharDoubleMap
AbstractCharFloatMap
AbstractCharIntMap
AbstractCharLongMap
AbstractCharObjectMap<T>
AbstractCharShortMap
AbstractDoubleByteMap
AbstractDoubleCharMap
AbstractDoubleDoubleMap
AbstractDoubleFloatMap
AbstractDoubleIntMap
AbstractDoubleLongMap
AbstractDoubleObjectMap<T>
AbstractDoubleShortMap
AbstractFloatByteMap
AbstractFloatCharMap
AbstractFloatDoubleMap
AbstractFloatFloatMap
AbstractFloatIntMap
AbstractFloatLongMap
AbstractFloatObjectMap<T>
AbstractFloatShortMap
AbstractIntByteMap
AbstractIntCharMap
AbstractIntDoubleMap
AbstractIntFloatMap
AbstractIntIntMap
AbstractIntLongMap
AbstractIntObjectMap<T>
AbstractIntShortMap
AbstractLongByteMap
AbstractLongCharMap
AbstractLongDoubleMap
AbstractLongFloatMap
AbstractLongIntMap
AbstractLongLongMap
AbstractLongObjectMap<T>
AbstractLongShortMap
AbstractObjectByteMap<T>
AbstractObjectCharMap<T>
AbstractObjectDoubleMap<T>
AbstractObjectFloatMap<T>
AbstractObjectIntMap<T>
AbstractObjectLongMap<T>
AbstractObjectShortMap<T>
AbstractShortByteMap
AbstractShortCharMap
AbstractShortDoubleMap
AbstractShortFloatMap
AbstractShortIntMap
AbstractShortLongMap
AbstractShortObjectMap<T>
AbstractShortShortMap
HashFunctions	Provides various hash functions.
OpenByteByteHashMap	Open hash map from byte keys to byte values.
OpenByteCharHashMap	Open hash map from byte keys to char values.
OpenByteDoubleHashMap	Open hash map from byte keys to double values.
OpenByteFloatHashMap	Open hash map from byte keys to float values.
OpenByteIntHashMap	Open hash map from byte keys to int values.
OpenByteLongHashMap	Open hash map from byte keys to long values.
OpenByteObjectHashMap<T>
OpenByteShortHashMap	Open hash map from byte keys to short values.
OpenCharByteHashMap	Open hash map from char keys to byte values.
OpenCharCharHashMap	Open hash map from char keys to char values.
OpenCharDoubleHashMap	Open hash map from char keys to double values.
OpenCharFloatHashMap	Open hash map from char keys to float values.
OpenCharIntHashMap	Open hash map from char keys to int values.
OpenCharLongHashMap	Open hash map from char keys to long values.
OpenCharObjectHashMap<T>
OpenCharShortHashMap	Open hash map from char keys to short values.
OpenDoubleByteHashMap	Open hash map from double keys to byte values.
OpenDoubleCharHashMap	Open hash map from double keys to char values.
OpenDoubleDoubleHashMap	Open hash map from double keys to double values.
OpenDoubleFloatHashMap	Open hash map from double keys to float values.
OpenDoubleIntHashMap	Open hash map from double keys to int values.
OpenDoubleLongHashMap	Open hash map from double keys to long values.
OpenDoubleObjectHashMap<T>
OpenDoubleShortHashMap	Open hash map from double keys to short values.
OpenFloatByteHashMap	Open hash map from float keys to byte values.
OpenFloatCharHashMap	Open hash map from float keys to char values.
OpenFloatDoubleHashMap	Open hash map from float keys to double values.
OpenFloatFloatHashMap	Open hash map from float keys to float values.
OpenFloatIntHashMap	Open hash map from float keys to int values.
OpenFloatLongHashMap	Open hash map from float keys to long values.
OpenFloatObjectHashMap<T>
OpenFloatShortHashMap	Open hash map from float keys to short values.
OpenHashMap<K,V>	Open hash map.
OpenIntByteHashMap	Open hash map from int keys to byte values.
OpenIntCharHashMap	Open hash map from int keys to char values.
OpenIntDoubleHashMap	Open hash map from int keys to double values.
OpenIntFloatHashMap	Open hash map from int keys to float values.
OpenIntIntHashMap	Open hash map from int keys to int values.
OpenIntLongHashMap	Open hash map from int keys to long values.
OpenIntObjectHashMap<T>
OpenIntShortHashMap	Open hash map from int keys to short values.
OpenLongByteHashMap	Open hash map from long keys to byte values.
OpenLongCharHashMap	Open hash map from long keys to char values.
OpenLongDoubleHashMap	Open hash map from long keys to double values.
OpenLongFloatHashMap	Open hash map from long keys to float values.
OpenLongIntHashMap	Open hash map from long keys to int values.
OpenLongLongHashMap	Open hash map from long keys to long values.
OpenLongObjectHashMap<T>
OpenLongShortHashMap	Open hash map from long keys to short values.
OpenObjectByteHashMap<T>	Open hash map from Object keys to byte values.
OpenObjectCharHashMap<T>	Open hash map from Object keys to char values.
OpenObjectDoubleHashMap<T>	Open hash map from Object keys to double values.
OpenObjectFloatHashMap<T>	Open hash map from Object keys to float values.
OpenObjectIntHashMap<T>	Open hash map from Object keys to int values.
OpenObjectLongHashMap<T>	Open hash map from Object keys to long values.
OpenObjectShortHashMap<T>	Open hash map from Object keys to short values.
OpenShortByteHashMap	Open hash map from short keys to byte values.
OpenShortCharHashMap	Open hash map from short keys to char values.
OpenShortDoubleHashMap	Open hash map from short keys to double values.
OpenShortFloatHashMap	Open hash map from short keys to float values.
OpenShortIntHashMap	Open hash map from short keys to int values.
OpenShortLongHashMap	Open hash map from short keys to long values.
OpenShortObjectHashMap<T>
OpenShortShortHashMap	Open hash map from short keys to short values.
PrimeFinder	Not of interest for users; only for implementors of hashtables.

Package org.apache.mahout.math.map Description

Automatically growing and shrinking maps holding objects or primitive data types such as int, double, etc. Currently all maps are based upon hashing.

1. Overview

The map package offers flexible object oriented abstractions modelling automatically resizing maps. It is designed to be scalable in terms of performance and memory requirements.

Features include:

Maps operating on objects as well as all primitive data types such as int, double, etc.
Compact representations
Support for quick access to associations
A number of general purpose map operations

File-based I/O can be achieved through the standard Java built-in serialization mechanism. All classes implement the Serializable interface. However, the toolkit is entirely decoupled from advanced I/O. It provides data structures and algorithms only.

This toolkit borrows some terminology from the Javasoft Collections framework written by Josh Bloch and introduced in JDK 1.2.

2. Introduction

A map is an associative container that manages a set of (key,value) pairs. It is useful for implementing a collection of one-to-one mappings. A (key,value) pair is called an association. A value can be looked up up via its key. Associations can quickly be set, removed and retrieved. They are stored in a hashing structure based on the hash code of their keys, which is obtained by using a hash function.

A map can, for example, contain Name-->Location associations like {("Pete", "Geneva"), ("Steve", "Paris"), ("Robert", "New York")} used in address books or Index-->Value mappings like {(0, 100), (3, 1000), (100000, 70)} representing sparse lists or matrices. For example this could mean at index 0 we have a value of 100, at index 3 we have a value of 1000, at index 1000000 we have a value of 70, and at all other indexes we have a value of, say, zero. Another example is a map of IP addresses to domain names (DNS). Maps can also be useful to represent multi sets, that is, sets where elements can occur more than once. For multi sets one would have Value-->Frequency mappings like {(100, 1), (50, 1000), (101, 3))} meaning element 100 occurs 1 time, element 50 occurs 1000 times, element 101 occurs 3 times. Further, maps can also manage ObjectIdentifier-->Object mappings like {(12, obj1), (7, obj2), (10000, obj3), (9, obj4)} used in Object Databases.

A map cannot contain two or more equal keys; a key can map to at most one value. However, more than one key can map to identical values. For primitive data types "equality" of keys is defined as identity (operator ==). For maps using Object keys, the meaning of "equality" can be specified by the user upon instance construction. It can either be defined to be identity (operator ==) or to be given by the method Object.equals(Object). Associations of kind (AnyType,Object) can be of the form (AnyKey,null), i.e. values can be null.

The classes of this package make no guarantees as to the order of the elements returned by iterators; in particular, they do not guarantee that the order will remain constant over time.

Copying

Any map can be copied. A copy is equal to the original but entirely independent of the original. So changes in the copy are not reflected in the original, and vice-versa.

3. Package organization

For most primitive data types and for objects there exists a separate map version. All versions are just the same, except that they operate on different data types. Colt includes two kinds of implementations for maps: The two different implementations are tagged Chained and Open. Note: Chained is no more included. Wherever it is mentioned it is of historic interest only.

Chained uses extendible separate chaining with chains holding unsorted dynamically linked collision lists.
Open uses extendible open addressing with double hashing.

Class naming follows the schema <Implementation><KeyType><ValueType>HashMap. For example, a OpenIntDoubleHashMap holds (int-->double) associations and is implemented with open addressing. A OpenIntObjectHashMap holds (int-->Object) associations and is implemented with open addressing.

The classes for maps of a given (key,value) type are derived from a common abstract base class tagged Abstract<KeyType><ValueType>Map. For example, all maps operating on (int-->double) associations are derived from AbstractIntDoubleMap, which in turn is derived from an abstract base class tying together all maps regardless of assocation type, AbstractSet. The abstract base classes provide skeleton implementations for all but few methods. Experimental layouts (such as chaining, open addressing, extensible hashing, red-black-trees, etc.) can easily be implemented and inherit a rich set of functionality. Have a look at the javadoc tree view to get the broad picture.

4. Example usage

 int[]    keys   = {0    , 3     , 100000, 9   };
 double[] values = {100.0, 1000.0, 70.0  , 71.0};
 AbstractIntDoubleMap map = new OpenIntDoubleHashMap();
 // add several associations
 for (int i=0; i < keys.length; i++) map.put(keys[i], values[i]);
 log.info("map="+map);
 log.info("size="+map.size());
 log.info(map.containsKey(3));
 log.info("get(3)="+map.get(3));
 log.info(map.containsKey(4));
 log.info("get(4)="+map.get(4));
 log.info(map.containsValue(71.0));
 log.info("keyOf(71.0)="+map.keyOf(71.0));
 // remove one association
 map.removeKey(3);
 log.info("\nmap="+map);
 log.info(map.containsKey(3));
 log.info("get(3)="+map.get(3));
 log.info(map.containsValue(1000.0));
 log.info("keyOf(1000.0)="+map.keyOf(1000.0));
 // clear
 map.clear();
 log.info("\nmap="+map);
 log.info("size="+map.size());

yields the following output

 map=[0->100.0, 3->1000.0, 9->71.0, 100000->70.0]
 size=4
 true
 get(3)=1000.0
 false
 get(4)=0.0
 true
 keyOf(71.0)=9
 map=[0->100.0, 9->71.0, 100000->70.0]
 false
 get(3)=0.0
 false
 keyOf(1000.0)=-2147483648
 map=[]
 size=0

5. Notes

Note that implementations are not synchronized.

Choosing efficient parameters for hash maps is not always easy. However, since parameters determine efficiency and memory requirements, here is a quick guide how to choose them. If your use case does not heavily operate on hash maps but uses them just because they provide convenient functionality, you can safely skip this section. For those of you who care, read on.

There are three parameters that can be customized upon map construction: initialCapacity, minLoadFactor and maxLoadFactor. The more memory one can afford, the faster a hash map. The hash map's capacity is the maximum number of associations that can be added without needing to allocate new internal memory. A larger capacity means faster adding, searching and removing. The initialCapacity corresponds to the capacity used upon instance construction.

The loadFactor of a hash map measures the degree of "fullness". It is given by the number of assocations (size()) divided by the hash map capacity (0.0 <= loadFactor <= 1.0). The more associations are added, the larger the loadFactor and the more hash map performance degrades. Therefore, when the loadFactor exceeds a customizable threshold (maxLoadFactor), the hash map is automatically grown. In such a way performance degradation can be avoided. Similarly, when the loadFactor falls below a customizable threshold (minLoadFactor), the hash map is automatically shrinked. In such a way excessive memory consumption can be avoided. Automatic resizing (both growing and shrinking) obeys the following invariant:

capacity * minLoadFactor <= size() <= capacity * maxLoadFactor

The term capacity * minLoadFactor is called the low water mark, capacity * maxLoadFactor is called the high water mark. In other words, the number of associations may vary within the water mark constraints. When it goes out of range, the map is automatically resized and memory consumption changes proportionally.

To tune for memory at the expense of performance, both increase minLoadFactor and maxLoadFactor.
To tune for performance at the expense of memory, both decrease minLoadFactor and maxLoadFactor. As as special case set minLoadFactor=0 to avoid any automatic shrinking.

Resizing large hash maps can be time consuming, O(size()), and should be avoided if possible (maintaining primes is not the reason). Unnecessary growing operations can be avoided if the number of associations is known before they are added, or can be estimated.

In such a case good parameters are as follows:

For chaining:
Set the initialCapacity = 1.4*expectedSize or greater.
Set the maxLoadFactor = 0.8 or greater.

For open addressing:
Set the initialCapacity = 2*expectedSize or greater. Alternatively call ensureCapacity(...).
Set the maxLoadFactor = 0.5.
Never set maxLoadFactor > 0.55; open addressing exponentially slows down beyond that point.

In this way the hash map will never need to grow and still stay fast. It is never a good idea to set maxLoadFactor < 0.1, because the hash map would grow too often. If it is entirelly unknown how many associations the application will use, the default constructor should be used. The map will grow and shrink as needed.

Comparision of chaining and open addressing

Chaining is faster than open addressing, when assuming unconstrained memory consumption. Open addressing is more space efficient than chaining, because it does not create entry objects but uses primitive arrays which are considerably smaller. Entry objects consume significant amounts of memory compared to the information they actually hold. Open addressing also poses no problems to the garbage collector. In contrast, chaining can create millions of entry objects which are linked; a nightmare for any garbage collector. In addition, entry object creation is a bit slow.
Therefore, with the same amount of memory, or even less memory, hash maps with larger capacity can be maintained under open addressing, which yields smaller loadFactors, which in turn keeps performance competitive with chaining. In our benchmarks, using significantly less memory, open addressing usually is not more than 1.2-1.5 times slower than chaining.

Further readings:
Knuth D., The Art of Computer Programming: Searching and Sorting, 3rd ed.
Griswold W., Townsend G., The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon, Software - Practice and Experience, Vol. 23(4), 351-367 (April 1993).
Larson P., Dynamic hash tables, Comm. of the ACM, 31, (4), 1988.

Performance:

Time complexity:
The classes offer expected time complexity O(1) (i.e. constant time) for the basic operations put, get, removeKey, containsKey and size, assuming the hash function disperses the elements properly among the buckets. Otherwise, pathological cases, although highly improbable, can occur, degrading performance to O(N) in the worst case. Operations containsValue and keyOf are O(N).

Memory requirements for open addressing:
worst case: memory [bytes] = (1/minLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value)).
best case: memory [bytes] = (1/maxLoadFactor) * size() * (1 + sizeOf(key) + sizeOf(value)). Where sizeOf(int) = 4, sizeOf(double) = 8, sizeOf(Object) = 4, etc. Thus, an OpenIntIntHashMap with minLoadFactor=0.25 and maxLoadFactor=0.5 and 1000000 associations uses between 17 MB and 34 MB. The same map with 1000 associations uses between 17 and 34 KB.