AI class unit 4

From John's wiki
Revision as of 16:22, 28 October 2011 by Sixsigma (talk | contribs)
Jump to navigation Jump to search

These are my notes for unit 4 of the AI class.

Probabilistic Inference

Overview and Example

{{#ev:youtubehd|1fVWQ-iZqsw}}

Welcome back. In the previous unit, we went over the basics of probability theory and saw how a Bayes network could concisely represent a joint probability distribution, including the representation of independence between the variables. In this unit, we will see how to do probabilistic inference. That is, how to answer probability questions using Bayes nets.

Let's put up a simple Bayes net. We'll use the familiar example of the earthquake where we can have a burglary (B) or an earthquake (E) setting off an alarm (A), and if the alarm goes off, either John (J) or Mary (M) might call.

Now, what kinds of questions can we ask to do inference about? The simplest type of question is the same question we ask with an ordinary subroutine or function in a programming language. Namely, given some inputs, what are the outputs? So, in this case, we could say given the inputs of B and E, what are the outputs J and M?

Rather than call them input and output variables, in probabilistic inference, we'll call them evidence and query variables. That is, the variables that we know the values of are the evidence, and the ones that we want to find out the values of are the query variables. Anything that is neither evidence nor query is known as a hidden variable. That is, we won't tell you what its value is. We won't figure out what its value is and report it, but we'll have to compute with it internally.

And now furthermore, in probabilistic inference, the output is not a single number for each of the query variables, but rather it's a probability distribution. So, the answer is going to be a complete, joint probability distribution over the query variables. We call this the posterior distribution, given the evidence, and we can write it like this:

It's the probability distribution of one or more query variables given the values of the evidence variables. And there can be zero or more evidence variables, and each of them are given an exact value. And that's the computation we want to come up with.

There's another question we can ask. Which is the most likely explanation? That is, out of all the possible values for all the query variables, which combination of values has the highest probability?

We write the formula like this:

asking which Q values are maximal given the evidence values. Now, in an ordinary programming language, each function goes only one way. It has input variables, does some computation, and comes up with a result variable or result variables. One great thing about Bayes nets is that we're not restricted to going only in one direction. We could go in the causal direction, giving as evidence the root nodes of the tree and asking as query values the nodes at the bottom. Or, we could reverse that causal flow. For example, we could have J and M be the evidence variables and B and E be the query variables, or we could have any other combination. For example, we could have M be the evidence variable and J and B be the query variables.

Here's a question for you. Imagine the situation where Mary has called to report that the alarm is going off, and we want to know whether or not there has been a burglary. For each of the nodes, click on the circle to tell us if the node is an evidence node, a hidden node, or a query node.

{{#ev:youtubehd|VYsys0If8bw}}

The answer is that Mary calling is the evidence node. The burglary is the query node, and all the others are hidden variables in this case.

Enumeration

{{#ev:youtubehd|VYsys0If8bw}}

Now we're going to talk about how to do inference on Bayes nets. We'll start with our familiar network, and we'll talk about a method called enumeration, which goes through all the possibilities, adds them up, and comes up with an answer.

So, what we do is start by stating the problem. We're going to ask the question of what is the probability that the burglar alarm occurred given that John called and Mary called? We'll use the definition of conditional probability to answer this.

So, this query is equal to the joint probability distribution of all 3 variables divided by the conditionalized variables. Now, note that I'm using a notation here where instead of writing out the probability of some variable equals true, I'm just using the notation plus and then the variable name in lower case, and if I wanted the negation, I would use the negation sign. Notice there's a different notation where instead of writing out the plus and negation signs, we just use the variable name itself, P(e), to indicate E is true.

That notation works well, but it can get confusing between does P(e) mean E is true, or does it mean E is a variable? And so we're going to stick to the notation where we explicitly have the pluses and negation signs.

To do inference by enumeration, we first take a conditional probability and rewrite it as unconditional probabilities. Now we enumerate all the atomic probabilities can calculate the sum of products. Let's look at just the complex term on the numerator first. The procedure for figuring out the denominator would be similar, and we'll skip that.

So, the probability of these three terms together can be determined by enumerating all possible values of the hidden variables. In this case there are 2, E and A. So we'll sum over those variables for all values of E and for all values of A. In this case they're boolean, so there's only two values of each. We ask what's the probability of this unconditional term? And that we get by summing out over all possibilities, E and A being true or false.

Now, to get the values of these atomic events, we'll have to rewrite this equation in a form that corresponds to the conditional probability tables that we have associated with the Bayes net. So, we'll take this whole expression and rewrite it. It's still a sum over the hidden variables E and A, but now I'll rewrite this expression in terms of the parents of each of the nodes in the network. So, that gives us the product of these five terms, which we then have to sum over all values of E and A.

If we call this product f(e,a), then the whole answer is the sum of F for all values of E and A, so it's the sum of four terms where each of the terms is a product of five numbers.

Where do we get the numbers to fill in this equation?

From the conditional probability tables from our model, so let's put the equation back up, and we'll ask you for the case where both E and A are positive to look up in the conditional probability tables and fill in the numbers for each of these five terms, and then multiply them together and fill in the product.

{{#ev:youtubehd|fxYL4PIBXiY}}

We get the answer by reading numbers off the conditional probability tables, so the probability of B being positive is 0.001. Of E being positive, because we're dealing with the positive case now for the variable e, is 0.002. The probability of A being positive, because we're dealing with that case, given that B is positive and the case for an E is positive, that we can read off here as 0.95. The probability that J is positive given that A is positive is 0.9. And finally, the probability that M is positive given that A is positive we read off here as 0.7. We multiply all those together, it's going to be a small number because we've got the 0.001 and the 0.002 here. Can't quite fit it in the box, but it works out to 0.000001197. That seems like a really small number, but remember, we have to normalize by the P(+j,+m) term, and this is only one of the four possibilities. We have to enumerate over all four possibilities for E and A, and in the end, it works out that the probability of the burglar alarm being true given that John and Mary calls, is 0.284.

And we get that number because, intuitively, it seems that the alarm is fairly reliable, John and Mary calling are very reliable, but the prior probability of burglary is low. And those two terms combine together to give us the 0.284 value when we sum up each of the four terms of these products.

Speeding Up Enumeration

{{#ev:youtubehd|DWO-XKo2iS8}}

We've seen how to do enumeration to solve the inference problem on belief networks. For a simple network like the alarm network, that's all we need to know. There's only five variables, so even if all five of them were hidden, there would only be 32 rows in the table to sum up. From a theoretical point of view, we're done. But from a practical point of view, other networks could give us trouble. Consider this network, which is one for determining insurance for car owners. There are 27 difference variables. If each of the variables were boolean, that would give us over 100 million rows to rum out. But in fact, some of the variables are non-boolean, they have multiple values, and it turns out that representing this entire network and doing enumeration we'd have to sum over a quadrillion rows. That's just not practical, so we're going to have to come up with methods that are faster than enumerating everything.

The first technique we can use to get a speed-up in doing inference on Bayes nets is to pull out terms from the enumeration.

For example, here the probability of B is going to be the same for all values of E and A. So we can take that term and move it out of the summation, and now we have a little bit less work to do.

We can multiply by that term once rather than having it in each row of the table. We can also more this term, P(e), to the left of the summation over a, because it doesn't depend on a.

By doing this, we're doing less work. The inner loop of the summation now has only three terms rather than five terms. So we've reduced the cost of doing each row of the table. But we still have the same number of rows in the table, so we're gonna have to do better than that.

The next technique for efficient inference is to maximise the independence of variables. The structure of a Bayes net determines how efficient it is to do inference on it. For example, a network that's a linear string of variables, X1 through Xn, can have inference done in time proportional to the number n,

whereas a network that's a complete network where every node points to every other node and so on could take time 2n if all n variables are boolean variables.

In the alarm network we saw previously, we took care to make sure that we had all the independence relations represented in the structure of the network. But if we put the nodes together in a different order, we would end up with a different structure. Let's start by ordering the node "John calls" first. And then adding the node "Mary calls". The question is, given just these two nodes and looking at the node for "Mary calls", is that node dependent or independent of the node for "John calls"?

{{#ev:youtubehd|r3mOvkvHbts}}

The answer is that the node for "Mary calls" in this network is dependent on "John calls". In the previous network, they were independent given that we know that the alarm had occurred. But here we don't know that the alarm had occurred, and so the nodes are dependent because having information about one will affect the information about the other.