Random Forests March 3, 2006Posted by dorigo in computers, internet, mathematics, physics, science.
Thanks to Riccardo Rando, a colleague working for the GLAST experiment (http://www-glast.stanford.edu/), I learned about an intriguing algorithm which uses several concepts that are dear to my practice of computationally intensive solutions to particle physics problems.
I remember reading an article on the italian version of Scientific American when I was a 13 year old boy. The article [P.Diaconis, B.Efron, Computer–intensive methods in statistics, Scientific American 248: 96-108], discussed new techniques for data analysis which were becoming popular as the unit price of megaflops was going down and personal computers were appearing on the scene.
The technique of Bootstrap, discussed in that article, struck me as a brilliant idea, and my young mind stored the information, hoping to be able to use it sometime in the future.
“Bootstrap” is a catchword taken from the phrase “to pull oneself up from one’s own bootstraps”. That is indeed what the technique does: it creates lots of statistical samples from an insufficient amount of starting data, by randomly picking individual entities from it and allowing multiple picks of the same entity.
I used bootstrapping techniques eight years ago for the computation of the significance of an interesting fluctuation in a mass spectrum I had the occasion of analyzing. I was then an internal referee of a controversial analysis in CDF during run I… Maybe I will write about that in another post. [UPDATE: it took six months, but I did it – see this post of January 2007]. However, it is thanks to Riccardo that I can play with the same concept again now.
The Random Forest algorithm is a classifier. It works by constructing classification trees for a training data sample by selecting subsamples with the bootstrap technique. Many trees (a forest) are created, and their collective analysis allows a classification which does not suffer from the usual sicknesses of similar methods.
For the sake of brevity my explanation of the algorithm stops here. The interested reader is encouraged to read more in the site by the original authors of the code (which is available there): http://www.stat.berkeley.edu/~breiman/RandomForests/
It remains to show a slide from last Wednesday’s talk at the CDF Top Mass working group meeting: