org.apache.mahout.math.stats

## Class LogLikelihood

• ```public final class LogLikelihood
extends Object```
Utility methods for working with log-likelihood
• ### Nested Class Summary

Nested Classes
Modifier and Type Class and Description
`static class ` `LogLikelihood.ScoredItem<T>`
• ### Method Summary

All Methods
Modifier and Type Method and Description
`static <T> List<LogLikelihood.ScoredItem<T>>` ```compareFrequencies(com.google.common.collect.Multiset<T> a, com.google.common.collect.Multiset<T> b, int maxReturn, double threshold)```
Compares two sets of counts to see which items are interestingly over-represented in the first set.
`static double` `entropy(long... elements)`
Calculates the unnormalized Shannon entropy.
`static double` ```logLikelihoodRatio(long k11, long k12, long k21, long k22)```
Calculates the Raw Log-likelihood ratio for two events, call them A and B.
`static double` ```rootLogLikelihoodRatio(long k11, long k12, long k21, long k22)```
Calculates the root log-likelihood ratio for two events.
• ### Methods inherited from class java.lang.Object

`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`
• ### Method Detail

• #### entropy

`public static double entropy(long... elements)`
Calculates the unnormalized Shannon entropy. This is -sum x_i log x_i / N = -N sum x_i/N log x_i/N where N = sum x_i If the x's sum to 1, then this is the same as the normal expression. Leaving this un-normalized makes working with counts and computing the LLR easier.
Returns:
The entropy value for the elements
• #### logLikelihoodRatio

```public static double logLikelihoodRatio(long k11,
long k12,
long k21,
long k22)```
Calculates the Raw Log-likelihood ratio for two events, call them A and B. Then we have:

 Event A Everything but A Event B A and B together (k_11) B, but not A (k_12) Everything but B A without B (k_21) Neither A nor B (k_22)
Parameters:
`k11` - The number of times the two events occurred together
`k12` - The number of times the second event occurred WITHOUT the first event
`k21` - The number of times the first event occurred WITHOUT the second event
`k22` - The number of times something else occurred (i.e. was neither of these events
Returns:
The raw log-likelihood ratio

Credit to http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for the table and the descriptions.

• #### rootLogLikelihoodRatio

```public static double rootLogLikelihoodRatio(long k11,
long k12,
long k21,
long k22)```
Calculates the root log-likelihood ratio for two events. See `logLikelihoodRatio(long, long, long, long)`.
Parameters:
`k11` - The number of times the two events occurred together
`k12` - The number of times the second event occurred WITHOUT the first event
`k21` - The number of times the first event occurred WITHOUT the second event
`k22` - The number of times something else occurred (i.e. was neither of these events
Returns:
The root log-likelihood ratio

There is some more discussion here: http://s.apache.org/CGL And see the response to Wataru's comment here: http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html

• #### compareFrequencies

```public static <T> List<LogLikelihood.ScoredItem<T>> compareFrequencies(com.google.common.collect.Multiset<T> a,
`a` - The first counts.
`b` - The reference counts.
`maxReturn` - The maximum number of items to return. Use maxReturn >= a.elementSet.size() to return all scores above the threshold.
`threshold` - The minimum score for items to be returned. Use 0 to return all items more common in a than b. Use -Double.MAX_VALUE (not Double.MIN_VALUE !) to not use a threshold.