# DEMPSTER-SHAFER THEORY

Proceedings of the 3rd INFORMS Workshop on Data Mining and Health Informatics (DM-HI 2008)

J. Li, D. Aleman, R. Sikora, eds.

DEMPSTER-SHAFER THEORY BASED SIMULTANEOUS LEARNING OF MULTIPLE BAYESIAN NETWORKS

Shuai Huang

Department of Industrial Engineering

Arizona State University

Tempe, AZ

Jing Li

Department of Industrial Engineering

Arizona State University

Tempe, AZ

Jinglz@http://www.wendangku.net/doc/5004ee104431b90d6c85c725.html

Abstract

Bayesian networks (BNs) have been used extensively in medical diagnosis due to their capability of learning causal relationships from data. Most existing BN structure learning algorithms have focused on learning a BN for a single problem/task. This overlooks the situations when data are available for multiple related problems/tasks, such as patients in different age cohorts, or with different genetic signatures. In these situations, learning BNs for multiple related problems/tasks simultaneously may improve the learning efficiency and effectiveness. Therefore, this paper develops a new algorithm to achieve simultaneous learning of BN structures for multiple related problems/tasks, which integrates an existing BN structure learning algorithm, called K2, and the Dempster-Shafer theory for combining different sources of evidences. The developed algorithm is demonstrated using the ALARM dataset collected for studying patient care in ICU. Keywords: Bayesian network, causal inference, Dempster-Shafer theory, multitask learning

1. Introduction

A Bayesian network (BN) is a graphical model to represent dependent, independent, and even causal relationships among the variables in a multivariate system, which is widely applied to AI, medical, and manufacturing areas [1]. A BN includes a structure and parameters: the structure of a BN is a directed acyclic graph (DAG), i.e., a set of variables connected by directed arcs, directions of arcs imply causalities, and there is no path beginning and ending with the same variable; the parameters of a BN are conditional probability distributions measuring strength of the causalities.

Recent advances in computation ability have created a tremendous opportunity for designing different kinds of algorithms to learn BN structures from data [2-4,8]. However, most of these algorithms have focused on learning the BN structure for a single problem/task. This overlooks the situations when data are available for multiple related problems/tasks. In these situations, learning BN structures for multiple related problems/tasks simultaneously may improve the learning efficiency and effectiveness.

As an example, Fig. 1 shows the BNs for two related problems/tasks, which could correspond to manufacturing products from two different production routes, or two different shifts; or correspond to patients in two different age cohorts; or correspond to gene regulatory structures from two species. Obviously, these two BNs have only two arcs different, i.e., in BN , but not in BN ; and in BN , but not in BN . Because the two BNs share most arcs, if learning of the two BNs can be conducted simultaneously rather than separately, then computational efficiency can be improved. More importantly, simultaneously learning of the two BNs can be more effective in identifying the true BN structure for each problem/task, because the shared arcs can be learned using pooled data with increased sample size. Inspired by this example, we believe that development of a simultaneous BN learning algorithm is highly desirable, as it will permit transferring information between the related problems/tasks, leading to more efficient and effective learning of BN structures.

Learning BN structures for multiple related problems/tasks is similar to multitask learning or inductive transfer which have been studied mainly for purpose of discrimination but rarely for BN structure identification [5,6]. An exception is [7], which achieved simultaneous BN learning by specifying a joint prior distribution of all the BN structures. However, selection of this prior distribution is subjective, depending on the expert’s belief on the level of similarity between the related problems/tasks.

To overcome the limitations of the existing BN learning algorithms, we develop a new algorithm to achieve simultaneous learning of BN structures for multiple related problems/tasks. This algorithm integrates an existing BN learning algorithm, called K2, and the Dempster-Shafer theory for combining different sources of evidences (in this paper, we consider that these evidences reside in the data of the related problems/tasks).

Figure 1: BNs for two related problems/tasks

2. Review of K2 algorithm and Dempster-Shafer theory

2.1 K2 algorithm for BN structure learning

In general, there are two types of BN structure learning algorithms, i.e., constraint-based [8] and score-based algorithms [2,3]. Score-based algorithms are more popular as they provide a score to evaluate fitness of the learned BN structure to the data. A typical score-based algorithm is called K2, which uses the posterior probability of a potential BN structure given the data as the score; and the larger the score, the better the structure fits the data [2].

Specifically, let , , be a set of variables; and D and be a dataset and a potential BN structure for , respectively. Then, the posterior probability of , i.e., the score, is | . Because | , ? and is the same for all potential BN structures, , can be used as a score for simplicity. It is known that when all variables are discrete, , ∏∏ ! !

∏ ! , where is the number of possible values for ; letting be the set of variables each with an arc pointing into (called the parents of ), then is the number of possible values for ; is the number of samples having the k-th value for and the j-th value for ; ∑ .

A nice property of the K2 algorithm is that its score is decomposable, i.e., , can be decomposed into a product of the scores for p local structures, times the prior , i.e., , ∏ , , where is the local structure consisting of variables and , and , is the score for . It is easy to see that , ∏ ! !

∏ ! , (1)

Given uninformative priors, i.e., same for all possible BN structures, the decomposable property of the K2 algorithm permits translation of the search for the optimal BN structure into the search for optimal local structures. This property will be utilized in developing our algorithm, as will be shown in Section 3.

2.2 Dempster-Shafer theory for combining different sources of evidences

The Dempster-Shafer theory provides a framework for combining different sources of evidence and resolving conflicts between them [9]. The theory includes the following key components:

(a) Frame of discernment, : is a finite set of elements; an element can be a hypothesis. The

set consisting of all the subsets of is called its power set, denoted by .

(b) Mass functions, : maps to interval [0,1], i.e., : 0,1 , Φ 0, ∑ =1. expresses a degree of belief regarding , based on all relevant and

available evidence.

(c) Rules of evidence combination: Suppose that , , are mass functions formed based

on evidences obtained from T different information sources, respectively. The evidences can be combined by certain rules, such as the mixture rule: ∑ , ∑ 1, 0. carries the joint information from all the sources.

3. Development of simultaneous BN structure learning algorithm (K2-DS) by integrating K2 algorithm and Dempster-Shafer theory

Given that there are related problems/tasks and a dataset , 1, , , is available for each problem/task, we develop a BN learning algorithm, called the K2-DS algorithm, that learns a BN structure for each problem/task using all the datasets available. This is different from the conventional K2 algorithm which learns a BN structure for each problem/task only based on the

dataset specific to that problem/task. The advantage of the K2-DS algorithm is obvious, because it makes use of not only the information embedded in the dataset of a problem/task, but also the information embedded in the datasets of other related problem/task. These different sources of information are combined by the Dempster-Shafer theory, leading to a new score for the K2-DS algorithm, as will be shown in Section 3.1. The score enables comparison of all potential BN structures for each problem/task and the BN structure with the maximum score will be considered as the optimal BN structure for that problem/task. However, computation of the scores for all potential BN structures is intractable. Therefore, a heuristic method is developed in Section 3.2 to find the optimal BN structure for each problem/task.

3.1 Formulation of simultaneous BN structure learning based on Dempster-Shafer theory

For the -th problem/task, 1, , , a score can be defined for each potential BN structure based on all the available datasets , , , by applying the Dempster-Shafer theory. Because the score for a BN structure can be decomposed into the scores for the local structures in the BN (see Section 2.1), if the score for each local structure can be defined, then the score for the entire BN structure can be easily obtained. Therefore, in what follows, we will show how to apply the Dempster-Shafer theory to define a score for local structure , , , based on datasets , , , where , , denotes the -th potential local structure consisting of variables and for the -th problem/task, 1, , , 1, , , 1, , .

(a) Frame of discernment: , , , , , , , .

(b) Mass functions: the mass function for , , , 1, , , which expresses a degree of

belief regarding , , based on dataset , 1, , , can be defined as , , , , | , , , ?, where , , , is given in (1). The mass functions for other elements in the power set , are zero. It can be shown that ∑ , , 1, so the definition of the mass function is valid.

(c) Rules of evidence combination: given datasets, , , , the mixture rule can be applied

to combine the mass functions , , , , , , , i.e.,

, , ∑ , , , , ∑ , 1, , 0. (2)

, , carries the joint information from all the datasets, so it serves as a score for local structure , , in the K2-DS algorithm. There are two reasons to use the mixture rule: one is its similarity to “linear option pool”, which is a way to combine probability distributions in classic probability community; the other reason is its good interpretability, in the sense that the datasets for related problems/tasks will contribute to the learning of the BN structure for a problem/task with different weights.

An important issue in applying the Dempster-Shafer theory is to determine the weights , in the mixture rule. , reflects the degree of belief that , , is supported by the dataset of the -th problem/task. This suggests a possible way to obtain , by , , ∑ , ?, where , ∑ , , ?, , 1,…, ; , , 1 if the -th and -th problems/tasks share the same local structure consisting of variables and , and , , 0 otherwise. In practice, estimates

for , , can be obtained through examining the BN structures learned by applying the K2 algorithm to the dataset of each problem/task separately. Although the estimates may not be exactly the same as the true values of , , , the performance of the K2-DS algorithm will not deteriorate, as will be shown in Section 4.

3.2 A heuristic method in search of optimal BN structures for multiple problems/tasks Based on the score in (2), all potential local structures consisting of variables and , for each problem/task, can be compared. The optimal local structure is the one associated with the highest score. All optimal local structures compose the optimal BN structure for the problem/task. However, computation of the scores for all potential local structure is intractable. Thus, a heuristic method is developed to find the optimal local structures, and later the optimal BN structure, for each problem/task, as shown in Fig. 2. Essentially, this heuristic method starts by making the assumption that a variable has no parents, and then adds incrementally the parent whose addition most increases the score of the local structure. When the addition of no single parent can increase the score, the method stops adding parents to the variable.

4. Examples

The K2-DS algorithm is demonstrated on the ALARM dataset for studying patient care in ICU. The dataset contains 37 variables and the true BN structure is shown in Fig. 3. Based on the true BN structure, we randomly add and delete arcs, creating BN structures, , , , , , for

multiple related problems/tasks, respectively. Then, one simulation dataset is generated from each of the BN structures. Next, the K2-DS algorithm in Fig. 2 is applied to the simulation datasets and the learned BN structures are denoted by , , , , . Each , is compared with , , 1, , , and the edit distance is used to measure the difference between

, and , . The edit distance is the number of edits (arc additions, deletions, or reversals) needed to get from

, to , . This experiment is repeated for N times and the average edit distance corresponding to each problem/task is recorded. The average edit distance based on the K2-DS algorithm is compared with that obtained by applying the K2 algorithm to each problem/task separately, and the results is shown in Table 1, implying that the K2-DS algorithm significantly improves the learning effectiveness. Figure 3: The BN for ALARM dataset

References

1. D.Heckerman, A.Mamdani, and M.P.Wellman. Real-world applications of Bayesian

networks, communications of the ACM, 38(3): 24-30, 1995.

2. Cooper, G. and Hersovits, E. (1992). “A Bayesian Method for the Induction of

Probabilistic Networks from Data”. Machine Learning, 9:309-347.

3. Heckerman, D. (1999). “A Tutorial on Learning with Bayesian Networks”. Learning in

Graphical Models, pages 301-354.

4. Buntine, W. (1996). “A Guide to the Literature on Learning Probabilities Networks from

Data”. IEEE Trans. On Knowledge and Data Engineering, 8:195-210, 1996.

5. Caruana, R. (1997). “Multitask Learning”. Machine Learning, 28(1):41-45.

6. Baxter, J. (1997). “A Bayesian/Information Theoretic Model of Learning to Learn via

Multiple Task Sampling”. Machine Learning, 28(1):7-39.

7. Alexandru, Niculescu-Mizi. and Rich, Caruana. (2007). “Inductive Transfer for Bayesian

Network Structure Learning ”. JMLR Workshop and Conference Proceedings Volume 2:339-346.

8. Spirtes, P.; Glymour, C.; and Scheines, R. (1993). Causation, Prediction, and Search.

Springer-Verlag.

9. Karl, S and Scott, F. (2002). “Combination of Evidence in Dempster-Shafer Theory”.

Sandia Report.

PCWP CO HRBP

HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2EXPCO2ARTCO2VENTALV VENTLUNG VENITUBE

DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT

STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP Table 1: Comparison of K2-DS algorithm with