General
Algorithms
Hadoop specific questions
Apache Mahout is a suite of machine learning libraries designed to be scalable and robust
The name Mahout was original chosen for it’s association with the Apache Hadoop project. A Mahout is a person who drives an elephant (hint: Hadoop’s logo is an elephant). We just wanted a name that complemented Hadoop but we see our project as a good driver of Hadoop in the sense that we will be using and testing it. We are not, however, implying that we are controlling Hadoop’s development.
Prior to coming to the ASF, those of us working on the project plan voted between Howdah – the carriage on top of an elephant and Mahout.
See http://ml-site.grantingersoll.com for old wiki and mailing list archives (all read-only)
Mahout was started by Isabel Drost, Grant Ingersoll and Karl Wettin. It started as part of the Lucene project (see the original proposal) and went on to become a top level project in April of 2010.</p><p style="text-align: left;">The original goal was to implement all 10 algorithms from Andrew Ng’s paper "Map-Reduce for Machine Learning on Multicore"</p>
There are some disagreements about how to pronounce the name. Webster’s has it as muh-hout (as in “out”), but the Sanskrit/Hindi origins pronounce it as “muh-hoot”. The second pronunciation suggests a nice pun on the Hebrew word מהות meaning “essence or truth”.
See MAHOUT-335
The Books, Tutorials and Talks page contains an overview of a wide variety of presentations with links to slides where available.
We are interested in a wide variety of machine learning algorithms. Many of which are already implemented in Mahout. You can find a list here.
There are many machine learning algorithms that we would like to have in Mahout. If you have an algorithm or an improvement to an algorithm that you would like to implement, start a discussion on our mailing list.
There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the algorithms list. In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as Apache Spark
If you are running training on a Hadoop cluster keep in mind that the number of mappers started is governed by the size of the input data and the configured split/block size of your cluster. As a rule of thumb, anything below 100MB in size won’t be split by default.