Constructs a decision tree by selecting the attribute with the highest information gain (entropy reduction) to split the data recursively until purity is achieved or stopping criteria are met.
Entropy of a set (S):
Information gain per attribute (A):
The attribute
Calculates the prior probability of each class and assumes independence between attributes to estimate the conditional probability, predicting the class that maximises the product (using logarithms for greater numerical stability).
Posterior probability (Bayes’ theorem with independence):
Predicted class (using logarithms for numerical stability):
where
Means and variances are calculated for each feature and class.
A small value (+1e-9) is added to the variances to avoid division by zero or extremely small values that could cause numerical instability.
The Gaussian density function (logPDF) never yields a probability of zero, because the Gaussian bell curve is always positive (albeit very small, never zero). This avoids the problem of zero probabilities that occurs in discrete Naive Bayes when a category does not appear in the training data.
Let bestLogProb be the best log-probability found, initialised to
For each class
-
The unnormalised posterior log-probability is calculated:
$$\mathrm{logProb}(c_k) = \log P(c_k) + \sum_{i=1}^{n} \log P(x_i \mid c_k)$$ where
$\log P(x_i \mid c_k)$ is obtained from the logarithmic density function (logPDF) of a normal distribution with mean$\mu_{k,i}$ and variance$\sigma_{k,i}^2$ :$$\log P(x_i \mid c_k) = -\frac{1}{2}\log(2\pi\sigma_{k,i}^2) - \frac{(x_i - \mu_{k,i})^2}{2\sigma_{k,i}^2}$$ -
If
$\text{logProb}(c_k) > \text{bestLogProb}$ , then it is updated:$$\mathrm{bestLogProb} = \mathrm{logProb}(c_k), \quad \mathrm{bestClass} = c_k$$
Finally, the class with the highest log-probability is returned:
The probability density function (“PDF”) Multivariate normal log probability distribution function.