# lecture_note_7

Views:

Category: Entertainment

## Presentation Description

No description available.

## Presentation Transcript

### Bayesian Learning:

1 Bayesian Learning

### Two Roles for Bayesian Methods:

2 Two Roles for Bayesian Methods Provides new practical learning algorithms: Naive Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) with observed data Provides a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities.

### Bayes Theorem:

3 Bayes Theorem P(h) = prior probability of hypothesis h P(D) = prior probability of training data D P(h|D) = probability of h given D P(D|h) = probability of D given h

### Choosing Hypotheses:

4 Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posterior hypothesis h MAP : If we assume P(h i ) = P(h j ) then can further simplify, and choose the Maximum likelihood (ML) hypothesis Recall that:

### An Example:

5 An Example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. P(cancer) = 0.008 P(  cancer)=0.992 P(+|cancer)= 0.98 P(- | cancer)= 0.02 P(+|  cancer)= 0.03 P(-|  cancer)= 0.97

### Basic Formulas for Probabilities:

6 Basic Formulas for Probabilities Product Rule: probability P(A  B) of a conjunction of two events A and B: Sum Rule: probability of a disjunction of two events A and B: Theorem of total probability: if events A 1 , …, A n are mutually exclusive with then

### Brute Force MAP Hypothesis Learner:

7 Brute Force MAP Hypothesis Learner For each hypothesis h in H, calculate the posterior probability Output the hypothesis h MAP with the highest posterior probability

### Relation to Concept Learning:

8 Relation to Concept Learning Consider our usual concept learning task instance space X, hypothesis space H, training examples D consider the FindS learning algorithm (outputs most specific hypothesis from the version space V S H,D ) Questions: 1. What would Bayes rule produce as the MAP hypothesis? 2. Does FindS output a MAP hypothesis??

### Relation to Concept Learning:

9 Relation to Concept Learning Assume fixed set of instances  x 1 ,…,x m  Assume D is the set of classifications D = c( x 1 ),…,c(x m )  Choose P(D|h) P(D|h) = 1 if h consistent with D P(D|h) = 0 otherwise Choose P(h) to be uniform distribution P(h) = for all h in H Then,

### Evolution of Posterior Probabilities:

10 Evolution of Posterior Probabilities

### Characterizing Learning Algorithms by Equivalent MAP Learners:

11 Characterizing Learning Algorithms by Equivalent MAP Learners The definition of P(h) and P(D|h) characterize he mplicit assumptions of the Candidate-Elimination and Find-S algorithm.

### Maximum Likelihood and Least-squared Error Hypotheses:

12 Maximum Likelihood and Least-squared Error Hypotheses Consider any real-valued target function f. Training examples  x i , d i , where d i is noisy training value with d i = f(x i )+ e i e i is random variable (noise) drawn independently for each x i according to some Gaussian distribution with mean = 0 Then the maximum likelihood hypothesis h ML is the one that minimizes the sum of squared errors:

### Maximum Likelihood and Least-squared Error Hypotheses (Cont’d):

13 Maximum Likelihood and Least-squared Error Hypotheses (Cont’d) Maximize natural log of this instead…

### Minimum Description Length Principle:

14 Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that minimizes where L C (x) is the description length of x under encoding C Example: H = decision trees, D = training data labels L C1 (h) is # bits to describe tree h L C2 (D|h) is # bits to describe D given h Note L C2 (D|h) = 0 if examples classified perfectly by h. Need only describe exceptions Hence h MDL trades off tree size for training errors

### Minimum Description Length Principle:

15 Minimum Description Length Principle Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is – log 2 p bits. So interpret (1): -log 2 P(h) is length of h under optimal code -log 2 P(D|h) is length of D given h under optimal code prefer the hypothesis that minimizes length(h) + length(misclassifications)

### Most Probable Classification of New Instances:

16 Most Probable Classification of New Instances So far we’ve sought the most probable hypothesis given the data D (i.e., h MAP ) Given new instance x, what is its most probable classification? h MAP (x) is not the most probable classification! Consider: Three possible hypotheses P(h 1 |D)= .4, P(h 2 |D)= .3, P(h 3 |D)= .3 Given new instance x, h 1 (x) = +, h 2 (x) = -, h 3 (x) = - What’s most probable classification of x?

### Bayes Optimal Classifier:

17 Bayes Optimal Classifier Bayes optimal classification: Example: P(h 1 |D) = .4, P(-1|h 1 ) = 0, P(+|h 1 ) = 1 P(h 2 |D) = .3, P(-1|h 2 ) = 1, P(+|h 2 ) = 0 P(h 3 |D) = .3, P(-1|h 3 ) = 1, P(+|h 3 ) = 0 therefore and

### Gibbs Classifier:

18 Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses Gibbs algorithm: Choose one hypothesis at random according to P(h|D) Use this to classify new instance Surprising fact: Assume target concepts are drawn at random from H according to priors on H. Then E[error Gibbs ]  2 E[error BayesOptimal ] Suppose correct, uniform prior distribution over H, then Pick any hypothesis from VS with uniform probability Its expected error no worse than twice Bayes optimal

### Naive Bayes Classifier:

19 Naive Bayes Classifier Learning Task: Each instance x is described by a conjunction of attribute values and where the target function f(x) can take on any value from some finite set V ; A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values <a 1 , a 2 , …, a n >; The learner is asked to predict the target value, or classification, for this new instance .

### Naive Bayes Classifier (Cont’d):

20 Naive Bayes Classifier (Cont’d) Assume target function f : X  V, where each instance x described by attributes a 1 , a 2 …, a n . Most probable value of f(x) is: Naive Bayes assumption: which gives Naive Bayes classifier:

### Naive Bayes Algorithm:

21 Naive Bayes Algorithm Naïve_Bayes_Learn(examples) For each target value v j (v j )  estimate P(v j ) For each attribute value a j of each attribute a (a j |v j )  estimate P(a j |v j ) Classify_New_Instance(x)

### Naive Bayes Example:

22 Naive Bayes Example Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

### Naive Bayes Example (Cont’d):

23 Naive Bayes Example (Cont’d) Consider PlayTennis again, and new instance  Outlk = sun, Temp = cool, Humid = high, Wind = strong  Want to compute: P(y) P(sun|y) P(cool|y) P(high|y) P(strong|y) = .005 P(n) P(sun|n) P(cool|n) P(high|n) P(strong|n) = .021  v NB = n

### Bayesian Belief Networks:

24 Bayesian Belief Networks Interesting because Naive Bayes assumption of conditional independence too restrictive But it’s intractable without some such assumptions… Bayesian Belief networks describe conditional independence among subsets of variables

### Belief Network:

25 Belief Network Belief network has many other names: Bayesian network Probabilistic network Causal network Knowledge map Belief network is a graph to represent the dependence between variables and to give a concise specification of the joint probability distribution.

### Topology of Belief Network:

26 Topology of Belief Network A set of random variables makes up the nodes of the network A set of directed links or arrows connects pairs of nodes. The intuitive meaning of an arrow from node X to node Y is that X has a direct influence on Y. Each node has a conditional probability table (CPT) that quantifiers the effects that the parents have on the node. The parents of a node are all those nodes that have arrows pointing to it. The graph has no directed cycles (hence is a directed acyclic graph, or DAG).

### An Example of Belief Network:

27 An Example of Belief Network A burglar alarm is installed, which fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes. There are two neighbors: John and Mary, which have promised to call you at work when they hear the alarm. John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls. Mary likes rather loud music and sometimes misses the alarm altogether. • Given the evidence of who has or has not called, we would like to estimate the probability of a burglary.

### Topology of Belief Network (Cont’d):

28 Topology of Belief Network (Cont’d) The topology of belief network represents the general structure of the causal processes in the domain. Do we need to add nodes corresponding to Mary currently listening to loud music, or to the telephone ringing and confusing John?

### Conditional Independence Relations in Belief Networks:

29 Conditional Independence Relations in Belief Networks Direction-dependent separation (d-separation ) If every undirected path from a node in X to a node in Y is d-separated by E, then X and Y are conditionally independent given E. A set of nodes E d-separates two sets of nodes X and Y if every undirected path from a node in X to a node in Y is blocked given E.

### Conditional Independence Relations in Belief Networks (Cont’d):

30 Conditional Independence Relations in Belief Networks (Cont’d) There is a node Z on the path for which one of three conditions holds: Z is in E and Z has one arrow on the path leading in and one arrow out. Z is in E and Z has both path arrows leading out. Neither Z nor any descendant of Z is in E, and both path arrows lead in to Z. Given the evidence E, if every path from X to Y is blocked, then we say that E d-separates X and Y.

### Specification of Conditional Probability Table (CPT):

31 Specification of Conditional Probability Table (CPT) Once the topology of the belief network is specified, we need to specify conditional probability table for each node e.g., Burglary Earthquake P(Alarm|Burglary, Earthquake) True False True True 0.950 0.050 True False 0.950 0.050 False True 0.290 0.710 False False 0.001 0.999

### An Example of Belief Network with Conditional Probabilities:

32 An Example of Belief Network with Conditional Probabilities

### Belief Network and Joint Probability Distribution:

33 Belief Network and Joint Probability Distribution Belief network is a representation of the joint probability distribution the CPT provides a decomposed representation of the joint. For example, we can calculate the probability of the event that the alarm has sounded but neither a burglary nor an earthquake has occurred, and both John and Mary call. Hence, a belief network can be used to answer any query. (1)

### A Method for Constructing Belief Networks:

34 A Method for Constructing Belief Networks First, we rewrite the joint in terms of a conditional probability using the definition of conditional probability: Compared to eq.(1), we see that the specification of the joint is equivalent to the general assertion that provided that Parent(X i )  {X i-1 , X i-2 , …, X 1 }.

### A Method for Constructing Belief Networks (Cont’d):

35 A Method for Constructing Belief Networks (Cont’d) The belief network is a correct representation of the domain only if each node is conditionally independent of its predecessors in the node ordering . In order to construct a belief network with the correct structure for the domain, we need to choose parents for each node such that this property holds. Intuitively, the parents of node X i should contain all those nodes in X 1 , X 2 , …, X i-1 that directly influence X i .

### General Procedure for Constructing Belief Networks:

36 General Procedure for Constructing Belief Networks Choose the set of relevant variables x i that describe the domain. Choose an ordering for the variables. While there are variables left: Pick a variable X i and add a node to the network for it. Set Parent(X i ) to some minimal set of nodes already in the net such that the conditional independence property is satisfied. Define the conditional probability table for X i . Does this procedure guarantee the network acyclic or not?

### Compactness and Node Ordering:

37 Compactness and Node Ordering A belief network can often be far more compact than the full joint. The compactness of belief networks is an example of a very general property of locally structured (also called sparse ) system. In a locally structured system, each subcomponent interacts directly with only a bounded number of other components, regardless of the total number of components. E.g., belief network n2 k vs. 2 n in full joint.

### Compactness and Node Ordering (Cont’d):

38 Compactness and Node Ordering (Cont’d) Construct a locally structured belief network is not a trivial problem We require each variable is directly influenced by only a few others; The network topology actually reflects those direct influences with the appropriate set of parents. Therefore, the correct order to add nodes is to add the “root causes” first, then the variables they influence , and so on until we reach the “leaves”, which have no direct causal influence on the other variables.

### What happens if we happen to choose the wrong order? (Cont’d):

40 What happens if we happen to choose the wrong order? (Cont’d) Two major drawbacks: More links in the networks; Some of the links represent tenuous relationships that require difficult and unnatural probability judgments.

### Expectation Maximization (EM):

41 Expectation Maximization (EM) When to use: Data is only partially observable Unsupervised clustering (target value unobservable) Supervised learning (some instance attributes unobservable) Some uses: Train Bayesian Belief Networks Unsupervised clustering (AUTOCLASS) Learning Hidden Markov Models

### General EM Problem:

42 General EM Problem Given Observed data X = { x 1 ,…, x N } Unobserved data Z = { z 1 ,…, z N } Parameterized probability distribution P( Y |  ). where Y = { y 1 ,…, y N } is the full data y i = x i  z i  are the parameters Determine:  that (locally) maximizes E[ ln P( Y |  )]

### EM Algorithm:

43 EM Algorithm Suppose N observations: x 1 , x 2 , ..., x N are independently and identically distributed (iid) from the following finite-mixture model: We let with

### EM Algorithm (Cont’d):

44 EM Algorithm (Cont’d) Assume that z 1 , z 2 , ..., z N are iid according to a multinomial distribution consisting of one draw on k densities with probabilities  1 ,  2 , ...,  k , respectively. We then have: We further assume that x 1 , x 2 , ..., x N are conditionally independent as given z 1 , z 2 , ..., z N respectively. Subsequently, we have

### EM Algorithm (Cont’d):

45 EM Algorithm (Cont’d) Consequently, the joint probability density function of the complete data Hence, the log likelihood of  is: The incomplete data theory (Dempster, 1977) treats as missing data, and replace it by its expected value conditioned on x t , i.e., E( | x t ; ), which is actually the posterior probability that x t belongs to the j th density p( x t |  j ).

### EM Algorithm (Cont’d):

46 EM Algorithm (Cont’d) Hence, we finally have The learning of EM is to maximize L( Y ; ) with two steps: Expectation (E)-Step: Given each input x t , fix , and calculate Maximization (M)-Step: fix , and update  to maximize L( Y ; ). 