Parametric Learning - EM

This section describes the feature of using data (i.e. a set of cases) to "learn" the conditional probability distributions when only the graphical structure is given. This is known as batch learning. Note that if the initial conditional probability distributions are also given it is still possible to use batch learning. 

Batch learning requires that all data is available when the learning process starts, whereas sequential learning allows the knowledge to be updated incrementally but sequential learning requires that an initial assessment of the conditional probabilities is given whereas batch learning only requires the graphical structure. The batch learning method currently implemented in Hugin is called EM-algorithm or EM-learning and was developed by Lauritzen [Lauritzen95] see also [Cowel&Dawid92].

A case is an assignment of values to some or all the nodes of a domain. If values have been assigned to all nodes, the case is said to be complete; otherwise, it is said to be incomplete. The data used by the learning algorithm is comprised of a set of cases. The cases are numbered sequentially from 0 to N-1, where N is the total number of cases. 

Although the learning algorithm currently implemented in Hugin cannot learn conditional distributions for continuous nodes, such nodes can contribute to the beliefs of the discrete (chance) nodes. Hence, the Hugin also provides functions for entering and retrieving values for continuous nodes.

Before learning can take place, in addition to the data set, the set of nodes for which conditional probability distributions should be learned must be specified. This set is specified as the set of nodes having experience tables (described in previous section). The input to the learning algorithm is the case data; this set of cases must be non-empty. Note that if the domain is an influence diagram model, then the domains decision nodes must be instantiated in all cases.

The learning algorithm needs to do inference, so the domain must be compiled before the learning algorithm is called. If successful, the procedure will update the conditional probability and the experience table for each node of the domain that has an experience table. Moreover, the inference engine will be reset and use the new conditional probability.

If the contents of the experience tables are all zeros, then the algorithm will compute the best ("maximum likelihood") conditional probability tables, assuming that any table is valid (i.e. there are no restrictions on the form of the tables). If the contents are not all zeros, then the product of the experience table and the conditional probability tables is used to form counts ("prior counts") that will be added to those derived from the data set. This is known as "penalized EM".

The starting point for the EM algorithm is the conditional probability tables specified prior to calling the algorithm, assuming that the domain has been compiled with these tables. If no table has been specified, uniform distributions are assumed. Sometimes it is desirable to enforce zeros in the conditional probability tables for the configurations that should be impossible (i.e. having zero probability).

The EM algorithm performs a number of iterations. For each iteration the logarithm of the probability of the case data given the current joint probability distribution is computed. This quantity is known as the log-likelihood, and the EM-algorithm attempts to maximize this quantity. The EM algorithm terminates when the relative difference between the log-likelihood for two successive iterations is sufficiently small. I.e. when the relative difference between the log-likelihood for two successive iterations becomes less than tolerance, which must be a positive number. By default the initial value of tolerance is 10-4.


Back