Disclaimers in scientific plots and the D0 single top signal December 16, 2006Posted by dorigo in mathematics, physics, science.
How often does it happen that you see a plot sending a deceiving message ? To me it happens all the time. Maybe because I am a perfectionist, or because I worked for too long in CDF (where the blessing of a plot may involve ten iterations where fonts, labels, colors, anything is changed from its original version), or maybe because I have this nasty habit of trying to understand a plot if it is shown to me, rather than just looking at it… Whatever it is, when I am shown scientific data in a graph I always find something I would have done differently, or something to criticize. Mine is the typical professoral look, if you wish – only, I am not a professor yet.
Our job as experimental physicists is difficult per se, but we are trained to perform it well and we usually do. However, the presentation of scientific data is something we are not trained well enough on, despite it being one of the most important duties we have. If we make a discovery, or just a measurement, the data it is based upon is worth billions of dollars. So we should be very careful in the way we present our results.
Now, most of the times things are easy – one just needs to specify a few things carefully, and disallow the broadcasting of a nude result stripped of the punctualizations. Say you measure top quark pair production at the Tevatron: you can quote a cross section of 6+-1 pb, and that is fine, but since your measurement in fact depends on the assumed value of the top quark mass Mt, when due to a linearly increasing signal acceptance with Mt your measurement decreases by a tenth of a picobarn every GeV above 175, to be precise you should write it as sigma(tt)=6-0.1(Mt-175)/GeV +-1 pb.
Such specifications, alas, are usually neglected for brevity. But brevity is a close friend to imperfection. You put out a paper with the measurement and with every explanation of the Mt dependence, and you think you have done your homework, but when you plot your result together with others, you omit the precious information. Soon your result is quoted everywhere without that Mt dependence, and averaged with other results which used another central value for the top mass in their evaluation. Imperfections add up quadratically… Entropy is everywhere.
Most failures to specify important details in the broadcasting of scientific data happen in plots; and plots lend themselves also to be deceiving in their own right. Take the plot on the left for instance: it shows three determinations of single top production by the D0 collaboration. The plot compares theoretical predictions for single top production with the experimental measurements. All looks well and clean, but I have several reservations:
- The plot does not say that the three determination are strongly (I estimate about 90%) correlated with each other. These seem three independent determinations, and by looking at the points at face value, one would say the theoretical model is probably underestimating the cross section. Wrong! The data is actually compatible with the model at the level of 1.1-sigma or so.
- The theory prediction is computed for a particular value of the top quark mass, and that value is indeed printed in the plot. Of course, the same applies to the experimental determinations, which (one hopes) have at least used the same top mass value of the theory band…
- The plot does not explain a critical point in the interpretation of the data: the fact that the expected sensitivity of the most sensitive of the three analyses – the one with smaller error bars, the “DT” method (decision trees) – was 2.1 sigma, while the data showed a 3.4 sigma effect. That means to say that this plot would have hardly been distributed (an experimental bias due to the procedure by which we decide for publication of our results) if the number of observed signal events had fluctuated low, rather than high: by fluctuating low, the “evidence” would have not been such, and the plot would have not made it to your desk. So if you look at these points with error bars, be advised that they would have stood no chance of laying on the left of the blue band. It is only because they are on the right of the band that the plot arrived to you!
These details easily escape most observers, but they are important. In the case of the third objection above, there is little one could do to avoid the problem. Only by deciding that the analysis would be made public with a given amount of data before looking at it, one would be saved from the bias. But that is not what typically happens… And so we have to live with rare processes being first measured high, and then slowly go back to their real cross section. It happened with the top quark evidence in 1994 too, and that time it was CDF who did it… And let me predict here that when the Higgs is discovered, it will show an abnormally high cross section too! (In fact, it almost happened already, with the LEP II evidence…)