Episode 138

fMRIs sound pretty scientific, right?

But what if it turns out that some scientific results, backed by fMRI data, may be unreliable?  That’s what Dr. Thomas Nichols, Professor and Head of Neuroimaging Statistics at the University of Warwick, has discovered in his recently published research:  about 10% of the scientific literature that relies on fMRI data is contaminated with false positives.  But how significant is that number, really?  Keep reading (or listening) to find out.

fMRIs and The Brain

images of fMRI scans
fMRI stands for functional Magnetic Resonance Imaging, a neuroimaging procedure that measures brain activity by measuring changes in blood flow and blood volume in the brain.  The basic concept is this:  when neurons in a certain area of the brain are firing, more blood will flow to that area to support the increased activity.  So, more blood flow = more neuronal activity.

It’s an easy and relatively sensitive form of measuring brain activity, which is why it remains a popular one.  However, fMRI measurements aren’t perfectly precise — they measure a secondary activity (blood flow), not the primary one (neuronal activity).  fMRI data can be easily misinterpreted, particularly if neuronal activity takes place very quickly in the brain, since rapid neuronal firing won’t pull in any additional blood to the area.

False Alarms

Dr. Nichols’ work is concerned with finding undiscovered false alarms using statistical analysis.  In his own words, the majority of his work has been studying the “really boring cases where there’s absolutely nothing going on” and the results are valid.  But, sometimes, he discovers some major problems with the accuracy of statistical methods for fMRI studies (in this case, task fMRI studies, where researchers look at the brains of people performing tasks).

Now here’s where we get into some statistical jargon — deep breath!  Dr. Nichols does a great job of explaining terms like P-value and “noise” in an easy-to-digest way.  Here are the cliff notes:

  • Statistical significance:  Quantifies how confident you can be in a result  If findings are statistically significant, it means it’s unlikely that they are by chance.
  • P-value:  Helps scientists determine if results are significant.  P-values are between 0 and 1 (or 0 and 100%) and tell you how strong the evidence is, one way or another.  The lower the number, the stronger the evidence in support of the hypothesis.
  • 5%:  The commonly accepted P-value for statistical significance.  This means that there can be false alarms (positive or negative) 1 out of 20 times without invalidating the findings, i.e. 5% of the data can be “wrong.”  Keep in mind, 5% is considered very weak evidence.
  • Noise:  A term in statistics for unexplained variation in data, for example errors.

Statistical Findings of the Study

In his recently published study, Dr. Nichols found that about 1/10th of the literature relying on task fMRIs (about 3,500 fMRI studies in total) have been affected by false positives and faulty data.

But!  That doesn’t mean all 3,500 are wrong.  If results have very low P-values (i.e. a very low possibility of the findings being random), the statistical significance reported may be incorrect, but the findings will still stand.  On the other hand, if the statistical significance is weak (i.e. right at the threshold of a 5% P-value), the results might be invalid.

Further Reading

Episode Highlights

0:22Functional magnetic resonance imaging
2:04This Week in Neuroscience: Mystery of what sleep does to our brains may finally be solved
4:40Audience interaction section
6:25The Smart Drug Smarts Bookshelf
6:53Introduction to Dr. Thomas Nichols
8:14What is fMRI?
9:37Misinterpreted fMRI data
10:09The blood–brain barrier
11:46fMRI analysis
16:46How to know if scientific methods are calibrated correctly
21:42Voxel-wise versus cluster-wise analysis
27:08Are there any specific studies that have been called into question?
28:07False positives versus false negatives
33:35Ruthless Listener-Retention Gimmick: The Science Behind Slurpees And 'Brain Freeze'

PS:  We don’t need statistical analysis to know it would be an error for you to miss signin up for our weekly Brain Breakfast.

Episode Transcript hideshow

— This Week in Neuroscience --

Jesse: Okay, so this one is kind of a big deal.  It’s a relevant This Week in Neuroscience because it’s based on research that was just published, but it’s also research based on data that was four years in the making looking at a theory about why people sleep.  If physical rest was all that was necessary, then why do we have to go essentially unconscious those eight hours?  So it certainly makes intuitive sense that sleep has something to do with the brain itself, but the question is: what? 

One theory which has been proposed by Giulio Tononi of the University of Wisconsin, Madison, is what’s come to be known as the Housekeeping Theory of Sleep, that our brains are cleaning things up, getting things ready for the next day, that’s what’s taking place while we’re sleeping.  Specifically, that while we’re sleeping, the brain is pruning its connections, actually making synaptic connections less strong, and making room for us to form new memories during the next day.  As he explains it, "Sleep is the price we pay for learning."  And to try to get some physical evidence as to whether this might hold up or not, Tononi’s team looked at the brains of mice before and after a full night’s sleep.  This was a painstaking and time-consuming process, scientists had to collect tiny chunks of brain tissue, they sliced it into ultra-thin sections and then they created 3D models of the brain tissue to identify the synapses (the synapses are the connections between neurons).  In these tiny brain slices, there were nearly 7,000 synapses they were looking at.  They didn’t know until just about a month ago which tissue belonged to which mouse, so they didn’t know whether a particular chunk of tissue was from a not-yet-slept mouse or a mouse who had had its sleep. 

But what resulted from this data collection was that the synapses taken in the sample at the end of the sleep period were 18% smaller than the synapses in the brain samples from before sleep, which does reinforce this idea that the synapses between neurons are being weakened while we sleep.  The findings were presented at the Federation of European Neuroscience Societies meeting in Copenhagen last week, and seemed to be well-received by the other scientists there.  If the Housekeeping Theory was right, it would help explain both why it’s a good idea to get some extra sleep on the night before you’re planning on doing a lot of learning as well as the relatively well-established idea that getting sleep after learning has taken place is an important way of helping to lock in that learning. 

Other evidence for the Housekeeping Theory includes the fact that EEG recording of the brain show the brain to be less electrically-responsive at the start of the day after a good night’s sleep than it is at the end.  So, the learning and the synaptic strengthening in the prospective new memories that have been formed during the day, at a physical level that’s making the brain more electrically responsive.  Interestingly, the team also discovered that some synapses seemed to be protected.  The largest fifth of the synapses in the samples that they studied stayed the same size both before and after sleep, which seems to indicate that the brain might be preserving its most important memories.  According to Tononi, "Regardless of anything else, you keep what matters." 

— Main Interview --

Jesse: So, the article was entitled "Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates," which might not be the kind of article title to just jump off the shelf and grab you, admittedly.  But that was the original paper that was published in the proceedings of the National Academy of Sciences, which is a big deal, that’s a major academic publication.  But sort of the highlight-reel summary is the idea that the statistical methods that scientists have been using to look at the data that they’ve received from fMRI studies and determine whether changes that they think they see in the brain are really significant changes or could be just noise in the data that’s getting inaccurately assessed to be the results of an experiment, that’s the question that this paper is asking.  FMRI is a technology that’s 25 years old now and has been used in as many as 40,000 different published studies, so you can imagine that if there’s a problem with fMRI or the way that fMRI is evaluated, that could be kind of ripping up the floorboards of a lot of scientific findings. 

I wanted to know to what extent this is something that we should be freaking out about, but I figured before going too crazy, let’s just go to the source and talk with one of the authors of this paper.  So, Dr. Thomas Nichols is the head of neuroimaging statistics at the University of Warwick, which actually has an Institute of Digital Healthcare, appropriately enough.  So, get ready to learn a lot about experiment analysis, which is something that we haven’t really talked about so much in the past, and also the nuts and bolts of what is actually being measured when an fMRI is being performed.  It is probably not what you think.  So with no further ado, let’s jump in now with Dr. Thomas Nichols. 


Dr. Nichols:: What does fMRI measure?  FMRI is actually messy.  It is not a quantitative method, it is measuring changes in blood flow and blood volume.  So, it’s complicated.  There are more quantitative methods that just measures a blood flow, or some very sophisticated methods that measure oxygen metabolism, but we use fMRI because it’s easy and it is relatively sensitive.  All the other methods that are more quantitative and give us more precise information about the physiology require either more time to acquire or specialized equipment or simply have less signal to noise, they’re not as precise.  So, fMRI is as popular as it is because it’s relatively easy to collect the data and it’s relatively sensitive.  But if you get into what exactly is it telling us and also how spatially specific, that’s another issue.  If I see a change in a single 2mm by 2mm by 2mm voxel, does that mean that brain tissue underlying that voxel is showing a change?  Well, actually there is some imprecision there because we’re measuring this effect from blood, and blood can run out or drain from the precise tissue and there’s some imprecision, because we’re not measuring the actual neuronal activity, we’re measuring the secondary measure that’s related to the amount of blood present.  So, it’s complicated, but the bottom line is that it’s telling us something about blood flow, blood volume, which is our clue to what the neurons are doing. 

Jesse: I remember reading pretty recently that fMRI can be pretty easily misinterpreted if something takes place very quickly in the brain.  Because a brain cell, if it’s briefly active, it might not burn through all of its glucose supplies and actually need more energy from the blood, so there might not be any change in the blood flow, and that would make whatever the brain cell just did essentially invisible on the fMRI. 

Dr. Nichols:: That is what I understand about the BOLD effect.  I’m not a BOLD physiologist, but the BOLD effect is a result of the neurons trying to re-establish their ionic potential and getting back to their stasis.  And so if they haven’t completely exhausted their reserve, that makes since. 

Jesse: I feel like before we get too far into this and start talking about fMRI specifics and statistical analysis, because a lot of this has to do with blood flow and the brain, I want to get a better understanding of the blood-brain barrier.  I’ve got what I feel has got to be a mistaken impression of what this thing is.  The brain is protected from direct exposure to blood because the blood could have toxins or something in it, and yet the brain is not a small organ, there’s no way that blood vessels on the outside of the brain are going to be diffusing oxygen all the way down into the core of the brain.  So, what am I not understanding? 

Dr. Nichols: So, the blood-brain barrier is what coats the entire vascular system in the brain.  So, you may think of it as this layer around a big artery, but it is also a layer that goes around the tiniest capillaries as well.  So, maybe you remember from high school, the movies of the single blood cells wiggling along a capillary—that’s the finest blood vessel, and those capillaries still have a blood-brain barrier around them, but they are so small that oxygen can diffuse out from capillaries into the surrounding tissue. 

Jesse: I was thinking of this as like a bag in my head, just kind of an oval-shaped thing.  But it sounds like the blood-brain barrier, if we were to just look at it as an item unto itself, it would have these millions of little branches descending down into the shape of the brain. 

Dr. Nichols: It is the skin of the blood system in the brain.  There is an area in the brain called the area postrema, which is a little tiny bit of blood vessels that stick out of the blood-brain barrier, and they are there to basically sense what is going on inside the blood, and in particular they can sense poison.  So, there are some things that don’t get across the blood-brain barrier that can really hurt you, and there’s this one little tiny part of the brain that’s got a little sensor, a little finger in the wind to check what’s going on in there. 

Jesse: Cool.  Well now that we’ve laid a little bit of the background there, let’s talk about fMRI analysis, which is where you’re specialized, your recent study and the paper that came out, and some of the hullabaloo that’s come in the wake of that. 

Dr. Nichols: There are two types of fMRI studies: there’s resting state fMRI studies and task fMRI studies.  Now, for the past 25 years, the mainstay has been task fMRI studies, and that’s the subject of our paper, basically trying to evaluate the accuracy of statistical methods for task fMRI.  That’s where we stick someone in a scanner, we show them a list of words, try to remember them, rest, active, rest, active… Trying to find areas of the brain that vary systematically between the rest and active states.  In the last, say, ten years, there’s been an immense excitement about resting state fMRI, what is going on in the brain when we don’t do anything at all.  And what’s also kind of neat is that in that discipline of resting state fMRI, there’s been a real culture of data-sharing, so there are actually vast libraries of shared publicly-available data of this resting state fMRI.  So, I am a statistician, I develop methods for all sorts of brain-imaging data but particularly for task fMRI data, and my research addresses the issue of inference, specifically how do we know what parts of the brain are showing a change.  So, I come up with techniques that basically say, "That change over there, that’s a real change, we can say that is different from noise.  That thing over there, hm, no, not so much, that’s probably consistent with noise."  So, that’s kind of the mainstay of my research, it’s what I’ve done for the past 20 years. 

Now within that world of statistical inference, there are different types of methods and there are methods that are called parametric methods, and they make assumptions—all statistical tests have assumptions, I should say.  There are parametric methods which use assumptions about the data to make their conclusions, and there are nonparametric methods that make some assumptions but much weaker assumptions.  In some special cases, they make almost no assumptions, but let’s just say they make much weaker assumptions.  So in my work, I have developed both parametric and nonparametric methods, but have found that these nonparametric methods have often worked better.  So this paper is one in a long line of papers that I have been working on to compare these two different types of methods. 

Jesse: When you say "worked better," that means when you are checking their findings against known data where you actually know what the correct answers are, they come out closer to the truth? 

Dr. Nichols: That’s right.  So, this is where it gets really boring as a statistician because actually, as a statistician, the first thing we worry about is control of false alarms, what we call false positives.  So, the majority of my work has actually been studying the really boring case where there’s absolutely nothing going on.  But I spend my life making sure that the methods that I develop, they’re never perfect, but that when I put noise data into my methods, they only give a detection a certain number of times.  In particular, we usually calibrate them for 5%.  This relates to this magical, you may have heard of P-value threshold, basically saying 1 out of 20 times is okay, we can put up with that; when there’s only noise, it’s okay to say there is something there 1 out of 20 times.  When I say "work," I basically mean that the false positives are controlled at this 5% level. 

Jesse: Just sort of as a general point of reference when we talk about things reaching statistical significance, 5% is sort of the commonly accepted value for statistical significance.  If noise might make something show up as having actually happened more than 5% of the time, then we say it’s probably not statistically significant, whereas if the chances that what looks like a finding might just be noise and that happens less than 5% of the time, then we say that we have achieved statistical significance.  Is that borderline correct? 

Dr. Nichols: That’s roughly correct.  There’s been lots of discussion about the correct interpretation of P-values, but the best way I think of describing them is they are talking about how consistent is the data with noise, and if they are small, they’re saying, "Hm, the data is inconsistent with noise."  It doesn’t mean that there truly is an effect or that there truly is no effect, but it’s just saying there’s an inconsistency between the data and this noise model.  It’s a confusing thing because if P-values are not small, you can’t say there is no effect, it’s just saying, "Eh, the data are consistent with noise."  And what people have found is that that 5% rule actually, all things considered, is some pretty weak evidence.  So outside of science, if anyone ever sees a P-value and it’s right on this magical 5% boundary, that’s the weakest evidence.  We’ve agreed just by rule of thumb, yeah, that’s evidence, but things right at the boundary, its weak evidence.  We look for smaller P-values, like . 01 and . 001 to be much stronger evidence. 

Jesse: It’s like if you remember back to high school, some classes you got pass/fail instead of A, B, C, D, F, like physical fitness was one of those things.  And things that were ranked only on pass/fail, we looked at them sideways even then. 

Dr. Nichols: My life’s work, in some ways, is not avoiding false positives, because you could never avoid false positives 100%, but to make sure that all of the methods that I develop are precisely calibrated.  And so this work that’s just been published in the Proceedings of the National Academy of Science is all about calibration is all about calibration.  We did a massive evaluation to see are all these different techniques calibrated as they should be at this 5% level. 

Jesse: Can you kind of unpack what calibration means?  You’re looking at data from past studies, so how do you go in after the fact and sort of do an experimental autopsy to make that determination of whether they were calibrated correctly? 

Dr. Nichols: So, the actual scientific work was calibrating the statistical methods that had been used, then the issue is what is the implications of our findings.  So first of all, there basically was an overstatement in the paper sort of implying that the entirety of the fMRI literature was at risk and implicated in our findings, and that was not correct.  I have done sort of a rough and ready analysis that the worst behavior that we saw perhaps affects about 1/10th of the literature—so it’s not the entire literature, it’s maybe a tenth. 

But what does it mean that it’s been affected?  Well, it means that if a researcher had used this one particular method and they got something just over the borderline of 5%, then maybe that result is actually consistent with noise when they thought it was significant.  Now, you might say, "Well, does that mean the scientific findings of the paper are wrong?"  It’s impossible to say in general because… So, for example, if the statistical finding wasn’t just over the borderline, if the statistical P-value was really strong, maybe if they had used a method that was more accurate, the result would still be significant.  It’s also possible that even if that result did not meet the fixed measures of significance, it would be correct; there would be other papers later on that actually showed the same effect.  And that’s what we use meta-analysis for, to sort of pool over a group of literature.  And then the last possible outcome is that actually, "No, actually it is a true false positive and it was just an unfortunate error."  So that’s the problem, that doing the autopsy, in practice, to do it for a large swath of the literature, is impossible. 

Jesse: Because we’re literally talking about tens of thousands of studies at this point that use fMRI data. 

Dr. Nichols: The number I came up with, and I did an analysis on my blog, it came out with a number of about 3,500. 

Jesse: 3,500 out of the larger total of fMRI studies in general. 

Dr. Nichols: Yeah, and again, Anders and myself are statisticians, methodologists.  We’re not used to writing for the broader science world and certainly I think may have overstated the implications.  It’s just hard because just because a P-value is wrong or maybe it should be slightly larger doesn’t mean that the scientific conclusions of the paper are false. 

Jesse: This might be a difficult thing to explain, but can you explain how the mistakes that were made, how you identified them and why this wasn’t done previously?  Like why this is a finding rather than something that people have been doing the right way all along? 

Dr. Nichols: So, I have, for all of my career, been comparing these two methods, the parametric and the nonparametric.  So one outcome of the paper was actually well-known and not novel at all.  One thing that no one’s really talking about is that this other type of inference is called voxel-wise, and we can get into this difference between the voxel-wise and cluster-wise difference later, but the voxel-wise results are just fine.  If anything, they’re a bit conservative.  And here’s this sort of binary thing in statistics, and it’s kind of weird, but we are very obsessed about knowing that our methods are calibrated so that the false positive rate is no more than 5%.  If it’s less than 5%, that’s not great, that means we don’t really have a very optimal method, but it’s safe.  So for some time, I’ve known that the voxel-wise methods are safe, if a little bit conservative, and that they could be better, in fact, if you used a nonparametric approach, you generally have more sensitive results.  So, that’s old news. 

I have also done evaluations of cluster size, and the problem is that I had previously used numerical simulations.  I basically did computer experiments to evaluate and compare the cluster-wise inference under parametric methods to the cluster-size inference under nonparametric methods, and I could see some stability but I just couldn’t see the pattern, and in some cases it was okay and sometimes it was a little bit conservative, and there wasn’t any pattern to make me worried.  In fact, I had done some large-scale evaluations where I saw the same thing, that with simulations this cluster-size inference was working fine. 

So, the main thing that’s new and the novel contribution of this work is the large-scale evaluation using real data, and it’s only possible because of all of this shared data that’s available where people are doing no task.  It’s like, I couldn’t have designed it better.  If I want to evaluate how task-based statistical methods work when there was no task, resting state data, it’s perfect, and we just didn’t have that at our disposal.  So, that’s the new thing, is actually using the real data to evaluate it, and that’s when we discovered the assumptions that I had been making in my numerical experiments basically match the assumptions of the standard methods, and so things looked like they were okay.  And it was only when we saw the real data, we saw that there was a mismatch between the assumptions of the theory and the real data.  And that mismatch has to do with the spatial dependence of the data, basically how blurry the data is, that the degree of blurriness varies in space, and just the scale, how far out in space that blurriness goes is farther than the theory had assumed. 

Jesse: That sounds like an ideal transition to talking about voxel-wise vs.  cluster-wise and what that means.  Voxels, I feel like I’ve got a pretty good idea what that is right off the bat.  Most people are familiar with the concept of a pixel, like your camera is X number of megapixels and that’s just a little teeny tiny square that has some defined properties, like its defined color in the case of a photo.  And a voxel is just a three-dimensional pixel, so you can almost think of them as like little LEGO blocks, a little three-dimensional address for a piece of reality.  And sort of by convention, voxels are always square, right? 

Dr. Nichols: Voxels or pixels don’t have to be equal in size, so they can be any shape.  Usually voxels have equal dimensions in two of them, like the X and Y, and then they maybe aren’t equal in the third dimension, or they can be equal.  It would be clear if I talked about a very concrete example and actually take it out of imaging all together.  Think about the problem of mapping crime in a city.  And actually to strengthen this analogy, let’s say we’re talking about a nice midwestern city, so it’s comprised entirely of square blocks.  And we have the crime data down at the block level, and now you want to make the inference, where in the city is there a crime problem?  Now, this is block-by-block data, say it’s over a small period of time, it’s going to be very variable, there’d be a lot of zeros and maybe one or two here that you wouldn’t call a problem, it’s just sort of the typical amount of crime.  So you could make an inference on where there is a crime problem in two ways.  You could do it block-by-block; you basically say I’m going to fix the threshold, let’s say 10 burglaries per year or something, and if there are more than 10, we’re going to say that block has a crime problem.  That’s analogous to voxel-wise or pixel-wise inference.  You’re making a judgement on each individual location, is there a crime problem.  And by that, you can make a binary map that says where is there crime above 10 burglaries a year and where is there not. 

Alternatively, you could use a two-step decision process where you would first use a threshold to define clusters, and then you would gauge the severity of the crime problem by the size of the clusters.  And so here you might threshold at, say, 5 burglaries per year per block and then create clusters, and you’d call two blocks in the same cluster if they shared a side, if they faced each other across the road.  And then you’d sort of figure out where are all the clusters, and then you’d decide does this area have a crime problem or not based on how big the clusters are.  You’d say 9 blocks; bigger than 9 blocks, we’re going to call that a crime problem or not.  So, that’s the distinction between voxel-wise and cluster-wise.  It’s a judgement of the significance based on the intensity at each point individually, and that’s voxel-wise; or in a two-step process where you have to define clusters by some particular threshold and then you make a judgement on the significance based on the size of the clusters. 

Jesse: This is my first time thinking about this so I might be completely wrong, but intuitively it seems like if you adjust the size of the voxels down or up, you could kind of get those two to sort of be functionally equivalent.  But I guess that must not be true. 

Dr. Nichols: I see your intuition, if you kind of blur things out more, you’ll learn about sort of the clustering aspects.  What doesn’t work with that is when you change the size of the voxels and you’re picking a particular center; you basically pick a new center for each of the grids, and that may not be the optimal center to detect something.  And also it may be that the effect that you’re looking for has a very irregular shape that wouldn’t be represented well by just a bigger voxel.  So, that’s why in our paper there’s this important issue of the cluster-defining threshold.  What is this threshold that we’re going to use to define these blobs?  And it turns out that it is arbitrary, you can set whatever you like, but there are two of these values that are mainly used in the literature, and so we evaluated the performance of these statistical methods at these two commonly used thresholds. 

Jesse: And did one come out fundamentally better than the other?  Are they ever used in conjunction, where people run an analysis using both of them and see where they agree and where they don’t? 

Dr. Nichols: That’s a good question.  It is discouraged to try looking at different ones because it then amounts to a fishing exercise. 

Jesse: Sort of cherry picking your data? 

Dr. Nichols: Yeah.  And it is tough, because some people say, "Well, was that the right threshold?  Maybe I should try a little bit lower, try a little bit higher."  For any one threshold, I can guarantee that the statistical method will deliver the calibrated 5% false positive rate you ask for.  But if you try a bunch of different ones and take the best result, I can guarantee you that calibration is lost and that your false positive rate will be greater.  So, people always have to commit to one of these and run with it. 

Jesse: Is that something that scientists, like when you’re saying I’m going to do a study and putting it before your review board to make sure your methods are correct and all that, is the analysis something that they need to declare upfront in case somebody has bad scientific ethics, so they can have their hand slapped later? 

Dr. Nichols: That is a great question, and in clinical trials that are registered with the government you indeed have to say exactly what your statistical analysis is going to be.  But it’s only in that domain.  However, there are a number of open science initiatives promoting this idea of pre-registration.  So, there are a number of psychologists and other scientists who are saying not just clinical trials but other areas of science would benefit from investigators committing to the exact analysis plan that they’re going to put their data to.  But, it is very, very rare still.  So, we looked at these two thresholds and our cluster-defining thresholds that we looked at were .01 and .001, so 1% or .01%, and we found that the 1% threshold had inflated false positive rates.  Again, the methods should be calibrated for exactly 5% false positives and they were coming out, depending on the software, the smoothness, much above that.  The worst was 70%, that they ranged from 30% and 70%. 

Jesse: Are you aware of any particularly big or prominent studies that have been directly called into question as a result of your findings?  Ones that the data maybe we once thought were significant but we should look again at? 

Dr. Nichols: Most scientists are savvy enough to know you don’t want to go to town on any one single map, and so you try to come up with lots of evidence.  So the most important studies—and I would say most studies that get into the high profile journals will have run a replication.  They’ll do it again slightly differently, or they’ll use a different set of subjects.  And so even if there’s something fragile in their method, even if false positive rates are not controlled exactly—remember false positives are going to occur anyway in the brain, and so they will not replicate in a second study.  So anyway, I don’t know of any examples offhand because I think the studies that are the most controversial, the most exciting, the most provocative, those investigators have to go to extreme lengths to make the cases that they’re making, and if there was random false positives, they just wouldn’t replicate as they were making the case. 

Jesse: That’s good, that sounds like science doing its job.  This doesn’t necessarily deal with your studies in particular, but why the emphasis on false positives vs.  false negatives?  And just for background reference for people, a false positive is when you get a test to see if you have cancer and they say, yes, you do have cancer but you don’t actually have cancer, that would be a false positive.  A false negative is if you get a test to see if you have cancer, they say, no, you don’t have cancer, but actually you do. 

Dr. Nichols: This is a great question, and some people have criticized brain imaging for focusing exclusively on false positives.  And indeed, I admitted that my research was largely based on establishing the false positive rate of the methods I work on.  In some settings, and especially in presurgical planning where you’re trying to figure out which brain tissue is actually involved in speaking or moving and you want to make sure you don’t cut that out when you go to remove a tumor, you are very much concerned with false negatives.  Because if you use a method that only worries about strong false positive control, you might miss some regions and then the surgeon would conclude, "Oh, I can sacrifice that brain tissue" when in fact you really need it.  So, it’s a different area of methodology.  It’s not classical statistics but there are some other areas of research that have been working to make inference methods that more equally balance false positives and false negatives, and that application to presurgical planning is a particularly good one because that’s a case where it really, really matters.  However, there have been psychologists who also argue that people like me, my obsession with strict false positive control, is misguided and that we should allow methods that don’t strictly control false positives and then let meta-analyses and multiple studies replicate or not as an approach, motivated by the fact that when we have strict control of false positives, we indeed are likely missing some things. 

I would say that the major takeaways from my study are the need for sharing data, frankly.  There is this strange situation that in brain imaging, when we report a study, all we have is a picture, a rendered brain with some colored blobs that show where the activation, and typically a table of coordinates that show where are the peaks, the most intense changes in the brain.  The actual statistic image, this 3D image of the statistical effect, is generally not shared, and this is in stark contrast to, say, bioinformatics or genetics, where if you do a gene expression study, you can’t publish that if you don’t include a table, usually in the supplemental data.  What was the statistical result in each of the genes, or if it’s a genome-wide association study, which there are many of, again, it will have a table in the supplementary data that says what was the statistical result at each and every SNP, single nucleotide polymorphism.  And for whatever reasons, there’s never been a tradition of that in brain imaging, and if there was, we could have gone back and looked at each of these studies and actually seen, "Well, if they’re making some guesses about the individual studies, what do we think that the right threshold should have been?" 

So, that’s something that has to change; I’m involved in the professional organizations in my area to try to get journals to change those standards.  Unfortunately, it’s a chicken and egg thing.  No journal wants to be the first one to tell the scientists some extra thing they need to do.  But at the same time, the journals know that this is good practice, so they want the leadership of an organization to sort of help make that happen, and I think we’re close to making that happen.  And then second to that is actually sharing everything, sharing the source data.  If we had all the data for all the previous studies, we could go back and reanalyze them with methods that have actually been much improved since those first studies were done.  That’s a bigger ask; that’s a really big pain, to get all your data organized and get it uploaded and sent somewhere, it’s a very different issue from just uploading a statistic map.  But we need to get there as well, because that’s, again, in the spirit of open science, and I think the funders are coming around, that they’ve been paying for all this data acquisition and then that data just disappears and never gets seen again.  So, I think that’s going to change soon, as well. 


Jesse: So, thank you so very much to Dr. Thomas Nichols for taking the time for that conversation.  I realize that that wasn’t the most actionable episode as far as something that you can walk away from this conversation and make changes in your own life, buy a new supplement, do a new practice.  But I really, really actually kind of like letting ourselves go full egghead every now and then and look under the hood a bit at what scientists are doing in their natural habitat, where some of these findings are coming from, why we might take them seriously. 

I think the really great thing about science is that it’s constantly endeavoring to be self-critical, that that’s one of the great things about it, that no finding is ever secure, nothing is sacred, and it’s incumbent for any scientist to always try to tear up the cornerstones, if possible.  Because even though on the one hand it would be great if the things that we think are accurate are, in fact, accurate, it would be even better to find out, oh, we were mistaken and maybe we have to tear up the floorboard a bit, but at least then we’re not going down the wrong road further.  I recently read a book about the founding of the Royal Society.  This is an early group of naturalists—naturalists are what scientists used to call themselves before the term "scientist" came into vogue; it started 500 or 600 years ago in the UK.  But their slogan was and is "Nullius in verba," which is Latin for literally "On the word of no one," but to translate that into modern English is "Don’t take anybody’s word for it," or sort of the scientific version of "Put up or shut up" when it comes to data. 

So, I think that the work that Dr. Nichols and his colleagues are doing is really fantastic, that it helps give other scientists the tools to be increasingly self-critical, productively skeptical of one another’s results, and recognizing that even though we can’t know anything for absolutely sure, we can help ourselves get a lot closer to being pretty darn sure about things.  And pretty darn sure can be awfully darn effective, as we see every day in the world around us with rocket ships and iPhones and all that stuff.  But if you’re listening to this podcast, you probably don’t need to hear me waving the science pom-poms, so let’s move ahead now to slurpee overdoses and brain freezes, ice cream, all that and more in the Ruthless Listener-Retention Gimmick. 


— Ruthless Listener-Retention Gimmick --

Jesse: So on the off-chance you’ve been off chance you’ve been living in the wilds of Antarctica and you’ve never had a brain freeze, what a brain freeze is is this: you drink too much of an icy cold beverage, oftentimes a slushy one, like a 7-Eleven slurpee is kind of the classic example, that you can drink fast, it’s really icy cold, and suddenly you get a very fast onset headache that just hurts.  It’s sharp, you can’t really do much to get rid of it; kind of like the pain from eating wasabi, it’s kind of there, it’s really intense for a few moments, and then it kind of just fades away.  There and then it’s gone, but it sucks while you have it.  And if you go on Youtube, you can find tons of videos of people who give their cats brain freezes on purpose and then videotape the cat, which seems like a pretty awful thing to do, except the videos are pretty funny. 

So, what a brain freeze is, the technical name is sphenopalatine ganglioneuralgia, and this all has to do with the blood vessels at the back of your throat, because those blood vessels in particular are the ones that, when they get hit by a really cold substance—too much ice cream, too much milkshake, whatever—they start sending pain signals.  So what happens when you drink an icy cold beverage is the blood vessels near the back of your throat, at first they rapidly constrict, they get tighter against the coldness, trying to reduce their surface area so whatever it is that’s cold that’s getting at them has less surface area to work with in order to draw out heat.  Then when the cold seems to have passed, once you’ve swallowed the big glob of ice cream or whatever it is, the blood vessels begin to dilate as they get warmer.  Now, the dilation of these blood vessels triggers something called the trigeminal nerve, and that nerve is connected to various regions of your face, including your forehead.  So something that’s happening in the back of your throat is causing pain connections all the way up to your forehead.  Now, nerves cannot always distinguish the original location of a pain trigger.  That’s why if you’re a guy and you’ve ever been kicked in the balls, you can feel that pain in pretty much every part of your body, but specifically up in your stomach and well beyond your groin region, it just hurts all over the place. 

So, back to our big fancy terminology, sphenopalatine ganglioneuralgia, a ganglion is just a bundle of nerves and neuralgia is another word for nerve pain.  So, ganglioneuralgia is saying a bunch of nerves that hurt all at once.  And sphenopalatine is apparently the location of this ganglion, somewhere at the back of the throat.  So #1, just memorize that term, sphenopalatine ganglioneuralgia, and impress your friends, amaze your enemies.  But #2, if you’re hell-bent on drinking slurpees, there are some things to do to make the brain freeze less likely to happen.  #1 is using your tongue or even your thumb, something that’s going to be warm, to press against that area at the top-back of your throat to kind of protect it from the cold so it doesn’t have this fast and excessive temperature change.  You could also drink warm liquids right after the slurpee, kind of like a hot water chase to keep the temperature change from feeling as extreme.  And finally, if this does not upset your table manners, you can kind of hold the cold liquid in the front part of your mouth, swish it around there, warm it up a little, and again, remove that temperature differential before you let the liquid hit that sensitive area in the top-back of your throat where those sensitive blood vessels live. 

Is there actually any physiological danger that is associated with a brain freeze?  Not as far as anybody knows.  This would be a hard study to get people to volunteer for, because nobody really likes a brain freeze.  The greatest misnomer about the brain freeze, of course, is that it’s not actually your brain that’s freezing or it’s not even your brain that’s hurting.  As you probably know if you’ve watched any of the Hannibal Lecter movies, people don’t actually have nerve endings on the inside of their brain, so neuroscientists have been able to do really weird things, like poke a person’s brain with a finger when they happen to have their head cut open for one reason or another.  The person can’t actually feel the touch, but sometimes the physical stimulation of that brain region can actually cause sensations or feelings or really anything that a brain could be involved with.  But this does lend one interesting possibility to something where brain freezes might actually come in handy.  There are still a lot of forms of headaches that people aren’t sure what the mechanism of action is, and just like a brain freeze can feel like it’s actually something inside your head that hurts, it’s got to be the headaches are doing something similar; it’s not your brain per se that is hurting.  So further study into the science behind brain freezes could theoretically offer insights into things like cluster headaches and migraines, which are a problem that science has not really adequately solved yet. 

Fuel your brain with our weekly newsletter.

Written by Hannah Sabih
Hannah believes there's nothing 8 hours of sleep and some kale can't cure (yes, she's from California). She's an avid runner, reader, and traveler, who brings you the latest and greatest in neuroscience via our social media channels.