Synonyms

Idiot’s bayes; Simple bayes

Definition

Naïve Bayes is a simple learning algorithm that utilizes Bayes rule together with a strong assumption that the attributes are conditionally independent, given the class. While this independence assumption is often violated in practice, naïve Bayes nonetheless often delivers competitive classification accuracy. Coupled with its computational efficiency and many other desirable features, this leads to naïve Bayes being widely applied in practice.

Motivation and Background

Naïve Bayes provides a mechanism for using the information in sample data to estimate the posterior probability P(y | x) of each class y, given an object x. Once we have such estimates, we can use them for classification or other decision support applications.

Naïve Bayes’ many desirable properties include:

  • Computational efficiency: Training time is linear with respect to both the number of training examples and the number of attributes, and classification time is linear with respect to the number of attributes and unaffected by the number of training examples.

  • Low variance: Because naïve Bayes does not utilize search, it has low variance, albeit at the cost of high bias.

  • Incremental learning: Naïve Bayes operates from estimates of low order probabilities that are derived from the training data. These can readily be updated as new training data are acquired.

  • Direct prediction of posterior probabilities.

  • Robustness in the face of noise: Naïve Bayes always uses all attributes for all predictions and hence is relatively insensitive to noise in the examples to be classified. Because it uses probabilities, it is also relatively insensitive to noise in the training data.

  • Robustness in the face of missing values: Because naïve Bayes always uses all attributes for all predictions, if one attribute value is missing, information from other attributes is still used, resulting in graceful degradation in performance. It is also relatively insensitive to missing attribute values in the training data due to its probabilistic framework.

Structure of Learning System

Naïve Bayes is based on Bayes rule

$$\mathrm{P}(y\,\vert \,\mathbf{x}) = \mathrm{P}\!\left (y\right )\!\mathrm{P(}\mathrm{\mathbf{x}}\,\vert \,y)\left /\right . \mathrm{P(}\mathrm{\mathbf{x}})$$
(1)

together with an assumption that the attributes are conditionally independent given the class. For attribute-value data, this assumption entitles

$$\mathrm{P(}\mathrm{\mathbf{x}}\,\vert \,y\mathrm{)} = \prod \limits_{i=\mathrm{1}}^{n}\mathrm{P(}{x}_{ i}\,\vert \,y)$$
(2)

where x i is the value of the ith attribute in x, and n is the number of attributes.

$$\mathrm{P(}\mathrm{\mathbf{x}}) = \prod \limits_{i=\mathrm{1}}^{k}\mathrm{P(}{c}_{ i}\mathrm{)P(}\mathrm{\mathbf{x}}\,\vert \,{c}_{i})$$
(3)

where k is the number of classes and c i is the ith class. Thus, (1) can be calculated by normalizing the numerators of the right-hand-side of the equation.

For categorical attributes, the required probabilities P(y) and P(x i | y) are normally derived from frequency counts stored in arrays whose values are calculated by a single pass through the training data at training time. These arrays can be updated as new data are acquired, supporting incremental learning. Probability estimates are usually derived from the frequency counts using smoothing functions such as the Laplace estimate or an m-estimate.

For numeric attributes, either the data are discretized (see discretization), or probability density estimation is employed.

In document classification, two variants of naïve Bayes are often employed (McCallum, 1998). The multivariate Bernoulli model utilizes naïve Bayes as described above, with each word in a corpus represented by a binary variable that is true if and only if the word is present in a document. However, only the words that are present in a document are considered when calculating the probabilities for that document.

In contrast, the multinomial model uses information about the number of times a word appears in a document. It treats each occurrence of a word in a document as a separate event. These events are assumed independent of each other. Hence the probability of a document given a class is the product of the probabilities of each word event given the class.

Cross References

Bayes Rule

Bayesian Methods

Bayesian Networks

Semi-Naïve Bayesian Learning