The talk slides I promised June 8, 2006
Posted by dorigo in computers, mathematics, personal, physics, science.trackback
Ok rather than burying them in a newer version of yesterday's post about my talk, I will post the slides I promised in a brand new post here.
I did discuss God's Choice algorithm here a few days ago. Let me do it graphically now, and at the end of this post I'll show a practical example of its application.
Imagine you have two kinds of events in your dataset. Let's call "background" the one you want to get rid of, and "signal" the one you'd like to enrich the dataset of.
Further imagine you know the details of the behavior of your background in a set of kinematical variables.
Now let's fish out of your dataset a handful of events - say 50 (5 balls in the graph here). You look at them and ask the question: do these smell of background ?
You test the 50-event sample by comparing its distribution in the kinematic variables to your background model, and get back from the test a single number, which tells you what is the probability that those 50 events are background.
Then you throw the 50 events back in the dataset, and randomly fish out another set of 50. You get another probability, and you continue this way for a loooong time.
After you've done this long enough, the sample with the most "background-like" properties can be selected out and discarded.

At that point, you can repeat the whole procedure… Your original dataset will contain 50 fewer events, but other things will be equal.
You continue fishing out 50-event samples, checking how well they model your understanding of the background in your dataset.
The most background-like set will again be preferentially made up by background events, so you can get rid of it again, and restart the whole procedure.

This thing can go on and on for as long as you want. The more the procedure is continued, the smaller your original dataset will become, and the richer it will be, in a relative sense, of whatever other process is contained in there.
That is, you are removing background events that behave according to your background model. Whatever else is there will benefit from the removal.
This allows you to search for unknown processes without any need to make a hypothesis for their properties.
This is the final outcome of the procedure: you remain with fewer red balls in the dataset than you had at the beginning, but their relative frequency is higher than it was before.
What I stressed in red in the slide above is very important: the removal of background-like events was done without preferentially selecting events that were the most background-like individually. What was removed was subsets that reflected the predicted behavior of background as a whole, and this cannot affect the probability density function of the background that is left behind.
Now for a sample application. I took 200 events from a Monte Carlo simulation of top-antitop production and 200 events from a MC simulation of single top production (an electroweak process not yet observed at the Tevatron), and made the exercise of removing iteratively top pair production events (the "background") with God's Choice algorithm, by comparing several kinematic variables of 5-event samples with their predicted distributions.
The plot in the slide above is a "purity versus efficiency" graph. You start from the lower right corner, where purity is 0.5 (half top-antitop and half single top) and efficiency is 1.0 (you have not discarded any single top signal event yet).
Then several sets of points depart from the lower right corner. Let's take the black ones. As you start removing 5-event sets from the 400-event sample, you throw away a bit of signal each time - you move left - but more so background, and the purity slowly increases - you move up.
You can decide to stop whenever you want, but it would make little sense to go all the way to the highest purity point of the black set, since you would have very few single top events left in your sample (something you are not allowed to know in real life, but here we know what each event is and we can follow the real evolution of the dataset as we discard events).
The black curve is not satisfactory: we lose too many signal events. But there are other curves there. The difference between them is the amount of random fishing that was done before deciding to discard the most "background-like" 5-event set.
You can see that as you increase the sampling, your chance of discarding more background-like events increases, and so your purity (y axis) is better, for the same efficiency (x axis).
So the above is a proof-of-principle that the algorithm works. One can test the distribution of the variables used to test the background consistency, for the remaining background in the dataset after a lot of subset removals, and one finds out that indeed the variables have been unbiased by the procedure.
That means that after you apply god's choice algorithm increasing the purity, you still have the possibility of using whatever other tool you prefer to further increase the signal to noise ratio, with the same variables that discriminate the background from the signal.
I think that is remarkable, and look forward to applying this procedure on a sample of real data containing some signal to extract…
Hi, an interesting technique! It assumes that the “signal” events are different from the background in a biased manner I think. That is, they all have to be different in the same way. … or rather you have to be careful in picking the variables used to check backgroundness.
For example (trivial and contrived) supposed your “signal” events are evenly distributed around a mean slightly different than your background points, but that the background has a higher variance. Your background “test” is too look for high variance samples. You will end up taking samples with signal points to the far side of the background average preferentially. hmm… it would still work… somewhat …
Fun Post!
Hi Markk,
thank you for your comment.
I think the technique is most meaningful when you have a large set of variables, and you do not know a priori which one can put in evidence a deviation if the S/N is boosted up. So you do the random fishing, and then check again… Something might creep up somewhere.
Cheers,
T.
Ahh so you would actually end up with several “enhanced” samples to work with. Almost like a genetic algorithm. Thanks.
No, I explained myself poorly. You continue to fish events out, that as a set are the most background-like set. When you do this enough times, what is LEFT is signal enriched.
Cheers,
T.
Yes. You keep fishing out and discarding occurrences of ‘background’ until you’re left with a higher relative density of ’signal.’
Thank you.