why you should check if the dice are loaded
Written by Andrea Albert
I was searching for evidence of dark matter as my PhD thesis when I was a graduate student at The Ohio State University. While I was still developing my analysis, another team of researchers doing something similar thought there was a chance they had found this evidence of dark matter at the center of the Galaxy.
It turned out not to be a real signal, which was incredibly disappointing. Though we have yet to find evidence for dark matter interactions in the Galaxy, at least I got something almost as valuable to a young particle astrophysicist: a really good lesson in statistical significance and how it's affected by both the data and the models we use to determine it.
For my thesis I looked for a “spectral line” signal in gamma-ray data collected from our Milky Way Galaxy. Such a signal would be very strong evidence for dark matter, since we suspect dark matter particles might interact with each other in a very specific way as to make a pile-up, or bump, of gamma rays somewhere in the spectrum. When my colleague Christoph Weniger (a Senior Research Fellow at GRAPPA Institute in Amsterdam) published a "tentative" detection of such a signal at 130 GeV in the Galactic center at 3.2-sigma, the whole particle astrophysics community, including me, started to get excited.
This could be huge! Finding evidence for dark matter interactions in the Galaxy might be even more exciting than finding the Higgs boson! But to appreciate the sense of excitement, you need to know what "sigma" means, why 3-sigma is special, and why a 3-sigma detection should be viewed with caution.
You may have heard scientists label a new findings in terms of sigma: "It's at a 3-sigma level", or even, "It has a "significance of 5-sigma", to announce an actual discovery. These terms have strict statistical definitions, which are covered in more detail in this blog post in the context of the Higgs boson discovery. In short, a 3-sigma detection event has a 0.3% probability of occurring by chance, and a 5-sigma event has just a 0.00006% probability of occurring by chance. Physicists traditionally call a 3-sigma detection "evidence", while a 5-sigma detection is considered a "discovery". It may seem like a 3-sigma detection should be good enough, but as any experienced physicist will tell you, many more than 0.3% of 3-sigma detections go away. I thought that sounded crazy too, but after my PhD thesis I learned that there's a lot more going on behind the scenes when calculating a statistical significance than the definition implies. I'd like to explain what "3-sigma" means to me now as an experienced experimental particle astrophysicist.
When we are looking for something new, like dark matter, we’re looking for a signal amidst a background (or “noise”). To aid our search, we create models of how we expect the signal and background to look. Then we fit our data to these models to help sort out the signal from the noise. You might already be wondering, “But what happens if something that’s really just part of the background looks like a small signal?” That's what the "sigma"–the statistical significance –tells you! The higher the “-sigma” value, the less likely it is that a signal is just a random fluctuation in the background model.
Remember, in simple terms, a “3-sigma” signal has only a 0.3% chance of being a background fluctuation, and a “5-sigma” signal is has just a 0.00006% chance!
Put like that, it sounds like Christoph's “3.2-sigma” detection had a like pretty slim chance of being spurious, right? Consider this: the chance of rolling three 6's in a row with a 6-sided die is similar, 0.5%. If you were walking down the street and found a 6-sided die on the sidewalk and rolled three 6's in a row, would you quit your job and move to Las Vegas to become a professional craps player? To me throwing three 6's in a row doesn't sound that out of the ordinary.
But what if there was something you didn’t know about the die you just found...
Let's say a friend just saw you roll three sixes in a row and believes that the die is fair–her “model” of the die is that it should roll a 6 exactly ⅙ of the time. If she's right, your chance of rolling three 6 sixes is indeed ⅙ * ⅙ * ⅙ = 0.5%; clearly under that model you may have some special talent for rolling the die. However, upon further investigation, you and your friend learn that the die you found is actually a loaded die and you'll always roll a 6. With this new information, the chance of rolling three 6's in a row is 100% in your new “loaded die” model; getting three, ten, or even a hundred 6’s in a row is no longer unexpected.
Even physicists sometimes forget that the statistical significance of a signal is only as good as the models. If you get new information, you can improve your model–if you fix a glitch in your experiment, for example, or update a measurement to be more precise. Since the model has changed, you’ll calculate a different statistical significance.
That's what happened when I finished my analysis and looked at the 130 GeV line. Unfortunately, when I ran my analysis on the 130 GeV line, the significance I calculated was much lower–only 1.5-sigma. This is because our updated signal model used extra information about the expected signal shape that Christoph didn’t have. This cast doubt on the 130 GeV line being real and with more data, the spurious signal is going away.
A lesson I learned from my PhD thesis is that calculating statistical significance accurately is hard: you need to put a lot of work into understanding your data to get the best signal and background models possible. Christoph knows this too. "To judge how much a particular ‘X-sigma’ claim is worth, [you have to] look at the nitty-gritty details of the analysis," he says. In statistics, an “X-sigma” significance assumes you have a perfect model–and while our models can be very good, they can never describe the data perfectly. As another colleague, Eric Charles (a staff scientist at SLAC), puts it, "the important thing to keep in mind is that data isn't a mathematical abstraction[…]There isn't really any perfect way to fold all of [the model imperfections] into one number.”
Like many things in life, the answer to “what does “3-sigma” really mean?” is complicated. Of course there is the strict statistical definition, but as any particle physicist would tell you, way more than 0.3% of 3-sigma results go away. However, there are those rare, amazing cases where what starts as a small 3-sigma signal can grow to 5-sigma discovery, like the detection of the Higgs Boson, which had its humble beginnings as a 3-sigma detection.