What is Apache Mahout?

The Apache Mahout™ project's goal is to build a scalable machine learning library.

Latest release version 0.9 has

  • User and Item based recommenders
  • Matrix factorization based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Latent Dirichlet Allocation
  • Singular Value Decomposition
  • Logistic regression classifier
  • (Complementary) Naive Bayes classifier
  • Random forest classifier
  • High performance java collections
  • A vibrant community

With scalable we mean:

Scalable to large data sets. Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms

Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.

Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more.

Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

Interested in helping? Join the Mailing lists.

Mahout News

1 February 2014 - Apache Mahout 0.9 released

Apache Mahout has reached version 0.9. All developers are encouraged to begin using version 0.9. Highlights include:

  • A new and improved Mahout website based on Apache CMS - MAHOUT-1245
  • Multi Layer Perceptron (MLP) classifier - MAHOUT-1265. This is an early implementation of MLP to solicit user feedback, needs to be integrated into Mahout’s processing pipeline to work with Mahout’s vectors.
  • Scala DSL Bindings for Mahout Math Linear Algebra. See this blogpost - MAHOUT-1297
  • Recommenders as a Search. See https://github.com/pferrel/solr-recommender - MAHOUT-1288
  • Support for easy functional Matrix views and derivatives - MAHOUT-1300
  • JSON output format for ClusterDumper - MAHOUT-1343
  • Enable randomised testing for all Mahout modules using Carrot RandomizedRunner - MAHOUT-1345
  • Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering - MAHOUT-1361. See this pdf for the details.
  • Upgrade to Lucene 4.6.1 - MAHOUT-1364

Changes in 0.9 are detailed at release notes.

The following algorithms that were marked deprecated in 0.8 have been removed in 0.9:

  • From Clustering:
      Switched LDA implementation from using Gibbs Sampling to Collapsed Variational Bayes (CVB)
    Meanshift
    MinHash - removed due to poor performance, lack of support and lack of usage
  • From Classification (both are sequential implementations)
    Winnow - lack of actual usage and support
    Perceptron - lack of actual usage and support
  • Collaborative Filtering
        SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone
        Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
        TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender
  • Mahout Math
        Hadoop entropy stuff in org.apache.mahout.math.stats.entropy

25 July 2013 - Apache Mahout 0.8 released

Apache Mahout has reached version 0.8. All developers are encouraged to begin using version 0.8. Highlights include:

  • Numerous performance improvements to Vector and Matrix implementations, API's and their iterators (see also MAHOUT-1192, MAHOUT-1202)
  • Numerous performance improvements to the recommender implementations (see also MAHOUT-1272, MAHOUT-1035, MAHOUT-1042, MAHOUT-1151, MAHOUT-1166, MAHOUT-1167, MAHOUT-1169, MAHOUT-1205, MAHOUT-1264)
  • MAHOUT-1088: Support for biased item-based recommender
  • MAHOUT-1089: SGD matrix factorization for rating prediction with user and item biases
  • MAHOUT-1106: Support for SVD++
  • MAHOUT-944: Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.1.
  • MAHOUT-1154 and friends: New streaming k-means implementation that offers on-line (and fast) clustering
  • MAHOUT-833: Make conversion to SequenceFiles Map-Reduce, 'seqdirectory' can now be run as a MapReduce job.
  • MAHOUT-1052: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values).
  • MAHOUT-884: Matrix Concat utility, presently only concatenates two matrices.
  • MAHOUT-1187: Upgraded to CommonsLang3
  • MAHOUT-916: Speedup the Mahout build by making tests run in parallel.
  • The usual bug fixes. See JIRA for more
    information on the 0.8 release.

Changes in 0.8 are detailed in the release notes.

Downloads of all releases available from Apache mirrors.

FUTURE PLANS

0.9

As the project moves towards a 1.0 release, the community is working to clean up and/or remove parts of the code base that are under-supported or that underperform as well as to better focus the energy and contributions on key algorithms that are proven to scale in production and have seen wide-spread adoption. To this end, in the next release, the project is planning on removing support for the following algorithms unless there is sustained support and improvement of them before the next release.

The algorithms to be removed are:

  • From Clustering:
    Dirichlet
    MeanShift
    MinHash
    Eigencuts
  • From Classification (both are sequential implementations)
    Winnow
    Perceptron
  • Frequent Pattern Mining
  • Collaborative Filtering
    All recommenders in org.apache.mahout.cf.taste.impl.recommender.knn
    SlopeOne implementations in org.apache.mahout.cf.taste.hadoop.slopeone and org.apache.mahout.cf.taste.impl.recommender.slopeone
    Distributed pseudo recommender in org.apache.mahout.cf.taste.hadoop.pseudo
    TreeClusteringRecommender in org.apache.mahout.cf.taste.impl.recommender
  • Mahout Math
    Lanczos in favour of SSVD
    Hadoop entropy stuff in org.apache.mahout.math.stats.entropy

If you are interested in supporting 1 or more of these algorithms, please make it known on dev@mahout.apache.org and via JIRA issues that fix and/or improve them. Please also provide supporting evidence as to their effectiveness for you in production.

1.0 PLANS

Our plans as a community are to focus 0.9 on cleanup of bugs and the removal of the code mentioned above and then to follow with a 1.0 release soon thereafter, at which point the community is committing to the support of the algorithms packaged in the 1.0 for at least two minor versions after their release. In the case of removal, we will deprecate the functionality in the 1.(x+1) minor release and remove it in the 1.(x+2) release. For instance, if feature X is to be removed after the 1.2 release, it will be deprecated in 1.3 and removed in 1.4.

16 June 2012 - Apache Mahout 0.7 released

Apache Mahout has reached version 0.7. All developers are encouraged to begin using version 0.7. Highlights include:

  • Outlier removal capability in K-Means, Fuzzy K, Canopy and Dirichlet Clustering
  • New Clustering implementation for K-Means, Fuzzy K, Canopy and Dirichlet using Cluster Classifiers
  • Collections and Math api consolidated
  • (Complementary) Naive Bayes refactored and cleaned
  • Watchmaker and Old Naive Bayes dropped.
  • Many bug fixes, refactorings, and other small improvements

Changes in 0.7 are detailed in the release notes.

Downloads of all releases available from Apache mirrors.

6 Feb 2012 - Apache Mahout 0.6 released

Apache Mahout has reached version 0.6. All developers are encouraged to begin using version 0.6. Highlights include:

  • Improved Decision Tree performance and added support for regression problems
  • New LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation
  • Reduced runtime of LanczosSolver tests
  • K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation
  • Reduced runtime of dot product between vectors
  • Added MongoDB and Cassandra DataModel support
  • Increased efficiency of parallel ALS matrix factorization
  • SSVD enhancements
  • Performance improvements in RowSimilarityJob, TransposeJob
  • Added numerous clustering display examples
  • Many bug fixes, refactorings, and other small improvements

Changes in 0.6 are detailed in the release notes.

Downloads of all releases available from Apache mirrors.

9 Oct 2011 - Mahout in Action released

At last, the book Mahout in Action is available in print. Sean Owen, Robin Anil, Ted Dunning and Ellen Friedman thank the community (especially those who were reviewers) for input during the process and hope it is enjoyable.

Find it at your favorite bookstore, or order print and eBook copies from Manning -- use discount code "mahout37" for 37% off.