jump to navigation

Data preservation survey October 30, 2008

Posted by dorigo in internet, physics, science.
Tags: , ,

Today I found twenty free minutes to fill out a questionnaire about data preservation in the CERN web site.

I have always considered the issue of data archiving very important, even crucial for the advancement of Science. I find it appalling that huge moneys are invested in building and operating large particle physics experiments, with no clear plan about what to do with the data, once the experiments close down.

There are several reasons why one needs to ensure that data is preserved.

  1. First and foremost, we do not burn books. Why should we dispose of data files we took so much care, years of efforts and the work of thousands of people, to put together ?
  2. Old data is potentially crucial to confirm new results, to disprove others, to compare to them. I hope I can give a very clear example of what I mean, this evening, when I will discuss a new result by CDF which is potentially groundbreaking, and which might be tested with older data from CDF itself as well as other hadron collider experiments.
  3. Old data may be invaluable as a laboratory to train new scientists. The number of Ph.D. students working on LHC experiments who have never ever worked with real data is disturbing: can we train them on Monte Carlo simulations ? Sure, but it is not the same thing, not really.
  4. The data is ultimately a world heritage. I maintain that it is not the property of this or that collaboration. The people lucky enough to have been given the privilege of analyzing data from high-energy physics experiments have done so thanks to the funds provided by whole countries. The data -being a form of distilled knowledge- are owned by the peoples, and I am sorry if I sound like a communist here. If you do not agree, it is you who look like a fool to me.

So, if you have a wish to provide your input in this important issue, why don’t you take the time to visit the site http://cern.ch/data-preservation-survey ?



1. Marco - October 30, 2008

Uh, interesting, it’s on my todo list for today too… πŸ™‚

I was wondering: will we be able to read back the LHC data in, say, 20 or 30 years? In ATLAS we are already experiencing problem with the test-beam data taken 4 years ago: new releases of the reconstruction software too often break backward compatibility. Ok, there are alternatives to raw data, but even so it could not be a easy task. Books are definitely a easier business!

2. Andrea Giammanco - October 30, 2008

> will we be able to read back the LHC data in, say, 20 or 30 years?

Only if Root remains stable for 20-30 years. πŸ˜‰
Since we have backward-compatibility issues even between consecutive releases, this is a problem.
I propose to dump everything in ascii πŸ™‚

The problem was debated in the ALEPH collaboration years ago, and there were several interesting proposal.
One was to use several old computers (not only disks, entire computers!), to store somewhere, that any person in the future have just to switch on, and then a user-friendly interface is automatically executed at the logon and gives instructions on how to access the data, which are stored in some other disk, stored in the same place.
(I don’t remember what solution what chosen, but I doubt it was this one.)

I was told that some QCD theorist a few years ago wanted to test his model with the old PETRA data (since unfolding all the implicit model assumptions – from obsolete assumptions – from the original papers proved to be impossible), but this was not possible because the data were stored but nobody remembered how was the format defined.

3. dorigo - October 30, 2008

Yes, backward compatibility is one of the main issues… The other, IMO, is indeed user-friendliness, as pointed out by Andrea. We base our analysis of data on knowledge which has to be captured and distilled, then stored together with the data. Otherwise we lose part of the value of the data themselves.

This discussion reminds me of a presentation I heard in Moriond in 2005, when some old Jade data was re-analyzed. The speaker gave a very interesting talk! Check the link in this comment.


4. Markk - October 30, 2008

I used to have to help architect systems to hold design records of X-Ray and other diagnostic imaging systems that would be kept for decades – as long as any of the systems were in use or any of the images would be kept.

The problem as already shown by the other responses is usually two fold.

First a hardware problem and access software problem. Will you be able to access the data in 30 years? The fact that storage is getting denser is good as it helps in that costs won’t be rising over time. But one has to make clear to all involved that there will someday be a transfer of data to anther medium, so however it is kept it shouldn’t be tied to a particular piece of equipment somehow.

One rule we always used was that technology had to be available from more than one vender and that there was no one vendor that could fail and stop the use of the technology. A good example of this is LTO tapes. Lots of vendors, good stability, high density and open standards. I am sure I will be able to buy or rent an LTO drive 20 or 30 years from now that will read any current tapes. A lot of the optical storage systems are bad examples as they came and went. Keeping an actual machine is not a good idea as we are still in a rapidly advancing era in storage. You or your administrators will want to move the data for cost reasons sometime in the next 30 years. Plan for that right away. Not the specifics but that it will happen. Leased disk arrays and copying as time goes by could be feasible thinking this way.

Second, and really more important though is the issue of understanding what the data was. Old design data from systems no longer in use doesn’t help a lot. In 20 years there is sometimes no-one left who was around and used the tools that created the data. Thus you really need to have a book, or really a library describing what is being saved in enough detail that you could teach a class with it. Also the data should have as few as possible assumptions built into it – the rawer the better. I think you could assume whoever wanted your data, at least, will be writing their own software to take and use the data – otherwise why would they want it? So keeping hardware to access it for the interface is less important than open file formats and clear information on what the data is.

On the other hand, that was usually a big problem for us, and where as system architects we had to tell the engineers and management to make tough decisions about perhaps dumping the ability to start from scratch with the data to recreate a whole machine. In your case I would guess this would translate into what level to save at, By not saving at the level of actual instrument traces, at the rawest data, you may lose the ability to re-analyze with better techniques, but you may have a data set that is much more accessible to many people. That is where directors and area experts have to make decisions. On our part it was in regard to legal responsibilities. They felt better when we could present the trade-offs as best we could, and that is where there would be work for you guys.

I guess the question would be, will the data be usable to test different models or look for currently unknown effects or will it be too tied to the assumptions made when it was collected and then processed for storage – was data thrown away in the archive process so that a person couldn’t any longer “go back” behind certain assumptions? The thing I found was that this was ALWAYS true, so that the question was always what is the right stuff to lose.

It is an interesting problem in its own right.

5. Tito - October 30, 2008

Once I had a chat about storing data for gravitational wave research in the most compatible, self-descriptive and interchangeable way. We agreed that XML could be an interesting choice, though it would be probably quite space-consuming if left uncompressed.

6. Anonymous - October 30, 2008

Thanks. I agree completely! It’s about time that HEP moved on this issue.

7. estraven - October 30, 2008

The data (…) is owned by the peoples, and I am sorry if I sound a communist here.

Communist, maybe; native english speaker, certainly no :-).

Sorry for the stupid joke: I think you have made a very, very important point, with which I completely agree. As you know, mathematicians never throw away anything – we’re still using Euclid’s old proofs (the correct ones, at least).

8. dorigo - October 30, 2008

Estraven, true – but “data is” is something many of my american colleagues write over and over again. I have been conditioned by them!
As for the lack of “like”, that is a typo of course. As you see, I am quite sensitive to the topic. That is because I make an effort to write correct English, and I always try to improve. So please continue to mention mistakes, I appreciate it.

Tito, xml would not do for particle physics data I believe. But maybe I am mistaken… It surely depends on how close to “raw” data one wants to get.

Markk, you highlight some of the critical points. What to throw away is in fact a more important question, once one decides to archive things. To give an example, getting rid of calibration data would mean raw data would be almost useless. But without raw data, many things become untestable. And also for the “know-how”, there are tough choices to make. If one keeps one software release which is considered “stable”, and gets rid of the others, some information will be lost anyway, since many published analyses were made with a previous release. These are only examples, but they highlight one point IMO: that the issue is given far too little attention, as even the most basic problems are unsolved yet.

Cheers all,

9. Alejandro Rivero - October 30, 2008

Emulation is part of the answer. This servers for 40-50 years, while the experts on the old systems are alive. Note how much of the Spectrum and VIC-20 software is still runing, this is 30 years. And there are some 1975 mainframes being emulated by aficionados.
From there, I do not know the answer. Yep, to be user friendly is a point.

10. onymous - October 31, 2008

I look forward to your commentary on what I expect to become a superjets-like fiasco….

11. dorigo - October 31, 2008


we are on two different planets, I gather. As far as the popularization of the information goes, I do not care at all whether the multi-muons turn out to be some nasty background or the discovery of the century. I just rejoice at seeing a difficult analysis coming out, and the rest of the HEP community know about it, rather than staying in private archives for years because we are very afraid it could be some detector effect, as has happened with the superjets. Have I been clear enough ?


Sorry comments are closed for this entry

%d bloggers like this: