Friday, April 20, 2007

Data crisis

Auugh, perhaps it's time to write about "progress" in the work front, as well.

I'm currently pulling through from a major data crisis that has set me back a few weeks in my schedule. It really was an '8' in the open Richter scale for thesis crises, as I for a while genuinely thought that there is no hope with this experiment, I'll need to run it again and will probably never get my PhD. And then there was this nauseating feeling that I will lose the cool results that I had based on a preliminary analysis. I've already published some of those results in a conference, and talked about them to people, and I was seeing myself having to explain that it was just a "fluke" and not real...

The crisis emerged when I was going through the data from the latest experiment again, so that I'd get the ultimate version of the analysis for an article and for my thesis. I was also planning to add some new analyses I've been working on. So, I started re-doing the analysis of before, but this time from a slightly different point of view, and looking at different metrics. I had done a "hasty" preliminary analysis for a conference, basically using the non-problematic part of the data and simple measures. Now I wanted to include all the data and go a bit further in the analysis, so that it would be publishable.

That's when I realised that there were lots more problems in the data than I had realised earlier. Only two out of seven pairs had performed "perfectly", and another 2 were having problems in about half the trials. If I would discard the problem pairs, I'd have no data to talk about. If I'd try to fix the problems, I'd spend the next eternity doing it. And what if, just what if, the effect I saw in the preliminary data was due to the noise, and it isn't actually there in the clean data? *shiver*

Well, thanks to the calming company and help of two of my colleagues, lots of breathing deeply and long, slow walks on lunch breaks I now have some solutions to the problems. Some of them would be sorted with cleverer data processing. This, I think I now have worked out. Also, I need solid, objective criteria for excluding "bad" data. This I still don't have. Tonight I'm planning to try to implement the fixes and see how many dubious trials I have left, and then try to figure out whether to deem them outliers (exclude) or parts of the data (include), and also try to come up with good criteria for doing one or the other. Probably the last one needs to be done first, though. Tomorrow I can then run the analyses and then I'll know whether I will ever graduate or whether it is back to the studio with more experiments.

Of course, conveniently, my supervisor has been out of the country for the whole duration of this crisis... I'm positive about sorting it out by Monday when he returns.

I think this is why I don't write about my work more in this blog these days... Just too scary! :-)

No comments: