jump to navigation

God’s choice algorithm June 3, 2006

Posted by dorigo in mathematics, personal, physics, science.
trackback

Maynard Handley ( http://name99.org/blog99/ ) so comments part three of my proceedings (see https://dorigo.wordpress.com/2006/05/25/proofread-my-paper-will-you-part-iii/ ): 

I’m not in a position to comment on your paper. I would however like to see a further post on the constraints on the algorithms you are discussing. I am assuming (is this correct) that these are not triggering algorithms, so they don’t have to run in nanoseconds on dedicated hardware, but run in non-real-time on events that made it past the triggers and were captured. Given this, is the only real constraint the data values that you have, or are there still compute time/memory constraints that limit you. In other words, do better (”perfect”) algorithms exist, but you have to work with limited compute and RAM budgets, and these mean you have to dumb down the algorithms?

The above remark stimulated me to discuss an algorithm I devised no less than 14 years ago. The code was written and tested then, but CPU limitations made it difficult to study its performance on a real problem…1992 computer resources available to a undergrad were smaller than what your cellphone is running right now. And, since I am preparing a talk for the CDF collaboration meeting whose title is "New tools for the top mass measurement", and I need ideas… I am happy to include that old idea in my talk, and to discuss it a little here beforehand.

The algorithm is called "God's choice". The name is descriptive, as will be clear in a moment.

Imagine you have a dataset made up of some signal you want to study and some background process. You of course cannot tell if each individual event is a signal event: if you could, you would then just pick the signal, and discard the rest, and all your physical measurements on the signal process would greatly benefit from the increased purity of the resulting sample.

There are several tools physicists use in high-energy experiments to increase signal purity: once a variable is found that has a different distribution for signal and background, one can apply a cut, separating events that are most likely signal from the rest; cutting on one variable at a time one can increase purity step by step. Another idea is to use several variables at once, by constructing a relative likelihood that an event characterized by a set of discriminating variables is signal or background: by cutting on the relative likelihood, one obtains a stronger separation between signal and background. A similar tool is a Neural Network: it also works in a multidimensional space, trying to "learn" how to discriminate events of one kind from events of another kind.

All the above methods assume that you know how your signal and background behave in the space of the discriminating variables. Not so the algorithm I am about to describe.

Imagine you only know your background well, and have an indication that a new physics signal is there, but you have no idea – or prefer to avoid making any assumption – on the behavior of that signal as a function of several kinematical variables that usually characterize a physical process in your experimental setup.

What can you do to increase the purity of the unknown signal ? If you were God, you could certainly pick all of the background events one by one, and discard them. The algorithm called "god's choice" does more or less the same thing: given a dataset of N events, it picks a small subset m<<N of them, and compares all their characteristics to those of the known background process, as a whole. That is, instead than picking events one by one, and determining the single-event likelihood, the algorithm evaluates whether the sample of m events is behaving according to expectations for a background sample.

The choice of m different events can be done in a gazillion different ways, if N is large and m is much smaller than N. Imagine a computer making a random choice, comparing the resulting set to the "background only" hypothesis, storing the results of the comparison and the identity of the m events, and then putting the m events back in the bag, fishing out another set. If the operation is repeated a billion times (say), a privileged set of m events can be identified, which best matches to the background hypothesis.

 The best match could be defined by a global kolmogorov distance between the distributions for the chosen set and the background shapes. Or a likelihood. Any statistical estimator would do.

At that point, the computer can take those "best match" events, and put them away from the sample of N. It then goes back to the bag of N-m events, and start fishing out again samples of m, repeating the comparison with background hypothesis. After another billion choices the procedure stops, the m best events are taken out of the bag, and the computer goes back to fishing m out of N-2m.

What the algorithm is doing is extracting from the original sample events that globally behave as background, using the knowledge of how the background behaves in the discriminating kinematical variables. If a signal is there which behaves differently, it will be more likely to remain in the bag, and its fraction will increase as more and more subsets of m events are taken away.

The strong point of the algorithm is that since you are extracting samples of background according to the shape of the kinematical variables, you are not "biasing" the remaining background in the bag. You are not applying a cut, slicing the space, nor are you selectively choosing events with a high background probability: the background that remains in the bag is unbiased. So, once you have fished out enough of it, you can study what is left, and maybe a signal will become more apparent.

The whole point of God's choice algorithm is that one does not deal with events, but with event sets. A single event possesses a value for each of the considered variables used to characterize the physical processes, while a sample of events possesses a distribution of values. By cutting away samples, rather than single events, one manages to avoid biasing the remaining background, with the huge benefit of being able to model it despite having taken away a large portion of it.

One is not limited by the poor knowledge of tails of the distributions. Limitations are due to computing resources. One needs a lot of CPU to run efficiently on large samples.  

I know that the algorithm works because I have tested it six years ago. But then I have not played with it any more… I now need a undergrad student to study it more. I am sure it would be a great tool to search for unknown new processes in a large dataset of new-physics-sensitive data, such as samples of vector bosons in CDF.  

Comments

1. riqie arneberg - June 3, 2006

I now need a undergrad student to study

AN undergrad student……….lol

2. Pietro Vischia - June 3, 2006

Wow! That sounds interesting!!!

The “let’s fish out randomly m events, then repeat the fishing for an arbitrary billion times, then take the best and eliminate it from data” reminds me (with the obvious differences) about the “let’s start with n variables, let’s compute an hyperball, then step after step make the hyperball better, then after an arbitrary billion times extract the correlations that have been found”. It is roughly the same type of proceeding or am I mistaken (I can’t remember very well the details of Hyperballs, now)?

In your part tests, how the m

3. Pietro Vischia - June 3, 2006

Hey it is unfair! It is not written that there is a maximum lenghth for the message to be displayed…

I continue:

In your past tests, how the m

4. Pietro Vischia - June 3, 2006

Uff, I understood… HTML tags…

In your past tests, how the m&lt<N condition turned out to be numerically? In other words, let’s say that N=10^6. What is roughly the maximum m that the algorythm would accept as “satisfyingally small” to make things work?

5. dorigo - June 3, 2006

Hi Pietro,

sound question, but I have no idea! It depends on the number of variables you are testing, the accuracy of your background model, the amount of data you want to discard.

If you want, all computer-intensive algorithms have a common basis: sheer computing power used to extract order out of randomness.

Cheers,
T.

6. hari - March 22, 2007

Re: “What can you do to increase the purity of the unknown signal ? If you were God, you could certainly pick all of the background events one by one, and discard them. ”

Dorigo, The question is why would God want to separate the signal from the background events? If God decided to create the bag of mixed nuts in the first place, it is counter-intutive for God to go back and separate them.


Sorry comments are closed for this entry

%d bloggers like this: