This post will take a trip through the basic steps involved in a task-based functional magnetic resonance imaging (fMRI) experiment (excluding the pre-processing steps, which are material for a future post I'm sure). It is aimed at folks with some knowledge of math/statistics, but generally I hope it will be accessible to anyone with an interest in how to design, analyze, and interpret such an experiment. Ignore the equations if you prefer, and focus on the interactive stuff!
The hemodynamic response function
fMRI is a method that allows researchers to indirectly observe whole-brain neural activity in a noninvasive way. It typically uses an echo planar imaging (EPI) protocol to obtain a blood oxygen level dependent (BOLD) contrast over sequential acquisitions, spaced at regular intervals known as the repetition time (TR). The BOLD signal is driven by the relative levels of oxy- and deoxyhemoglobin (which have different magnetic properties) in cerebral blood vessels. Through a mechanism called neurovascular coupling, neural activity results in an increase in cerebral blood flow (and volume), which increases BOLD intensity due to an influx of oxyhemoglobin-rich blood.
Our first goal in the analysis of fMRI data is to define a transfer function between the BOLD signal and the underlying neural activity that generated it. The change in BOLD signal in response to neural activity can be modelled as a "hemodynamic response function" (HRF). Three such functions are shown below:
These curves represent a hemodynamic response to a very brief pulse of neural activity. At left, I have plotted a gamma function [1]. In the middle is a mixture of 3 gamma functions which together model the pre- and post-undershoots that are observed in the hemodynamic response. At right is a so-called "balloon" model [2,3], which simulates the elastic properties of blood vessels and the smooth muscles that control this elasticity (think of the vessels as a balloon being inflated or deflated).
The idea is that the neural activity we want to estimate is convolved (multiplied while shifting over time) by one of the functions above to produce the BOLD signal we observe with fMRI. These functions are referred to as basis functions. What the above plots show is that our simple gamma function, with one free parameter [4], does a fairly good job of capturing the main form of the empirical hemodynamic response (in particular, its dominant peak), compared to the triple gamma function (with 3 free parameters) or the balloon model (with 6 free parameters). However, it does miss the pre- and post-undershoots.
In practice, the triple gamma HRF is commonly used for statistical analysis of fMRI data, and is often referred to as the "canonical" HRF [5]. To get a feel for how this function is related to the underlying gamma curves, you can play with the sliders in the figure below:
The \(\beta_i\) slider controls the amplitude of each gamma curve (left plot). The black curve in the right plot shows canonical HRF, and the green curve shows the sum of gamma functions produced by the \(\beta\) coefficients — your job is to play with these coefficients until the difference between the black and green curves is minimal.
First-level inference on trial-based designs
A trial-based design essentially chops up the experimental session into discrete epochs, or trials, and delivers stimuli the same way in each trial. Stimuli can be manipulated in such a way that a contrast can be performed between different trial types, meant to elucidate a phenomenon of interest.
A simple example of such an experiment might be to display (in random order) an image of houses (trial type 1) or faces (trial type 2) for 100 ms, while acquiring a dynamic EPI sequence with a TR of 1.0 s. Our research could question could be: for each voxel, is there is difference in neural activity for the presentation of houses, relative to faces?
The time series for this stimulus presentation (a series of "delta" pulses separated by an intertrial interval of 15 s) are shown below in red (for house trials) and blue (for faces). When we convolve these stimulus time series with a canonical HRF, we get the new time series shown in the same colours in the middle. At bottom, I am showing a simulated BOLD time series (light orange) generated by summing these convolved time series (0.5 x Houses + 1.0 x Faces) and adding noise.
For each voxel in the grey matter of the 4D fMRI image, we have a time series \(y(t)\), representing the BOLD contrast at each sequential EPI acquisition. Combining this with our convolved stimulus time series \(x_1(t)\) (houses) and \(x_2(t)\) (faces), across all trials in the experiment, allows us to define the inference problem as a linear model, of the form:
$$y(t) = \beta_0 + \beta_1 \cdot x_1(t) + \beta_2 \cdot x_2(t) + \epsilon$$
where \(\epsilon\) is an error term. This research design could also be represented such that all the time points are part of a matrix \(X\). The intercept term (corresponding to \(\beta_0\)) is part of this matrix, but has a value of 1 for every time point. In this matrix form, our linear model looks like:
$$\mathbf{y} = \boldsymbol{\beta} \mathbf{X} + \epsilon$$
This model can be fitted using ordinary least squares (OLS; see also this blog post) [6]. The fit to our simulated data is shown below; light blue is our simulated signal (\(y\)) and dark red is our prediction (\(\hat{y}\)). The model error (\(\epsilon = y-\hat{y}\)), shown in purple, almost perfectly reproduces the added noise (orange).
Once we have our fitted \(\beta\) coefficients, called \(\boldsymbol{\hat{\beta}}\), our next task is to test some hypotheses. Our null hypothesis for the original research question (is activity in this voxel different for images of houses than for faces?) is that the \(\beta\) coefficients for the two conditions are equal:
$$H_0: \beta_1 - \beta_2 = 0$$ $$H_a: \beta_1 - \beta_2 \neq 0$$
We can test this by designing a contrast matrix \(\textbf{C}\) with a column for each coefficient, that tests the difference against 0. It looks like this:
\[ \textbf{C} = \begin{bmatrix} 0 & 1 & -1 \end{bmatrix} \] \[ \boldsymbol{\hat{\beta}} = \begin{bmatrix} \hat{\beta}_0 & \hat{\beta}_1 & \hat{\beta}_2 \end{bmatrix} \] \[ \textbf{C}^T \boldsymbol{\hat{\beta}} = \hat{\beta}_1 - \hat{\beta}_2 \]
Under the null hypothesis, the difference \(\beta_1 - \beta_2\) has a t distribution, centered on 0. Our statistical inference can then be performed as a two-tailed one-sample t test [7]. The test statistic is derived as:
$$t = \frac{\hat{\beta}_1 - \hat{\beta}_2}{\sqrt{\hat{\sigma}^2 \textbf{C} (X^T X)^{-1} \textbf{C}^T}}$$
That equation's a bit of beast. Don't worry if you can't figure out where it comes from (or check out reference 7 if you do want to dig deeper). The tl;dr for this is that the denominator is called the standard error of the contrast, with \(\hat{\sigma}^2\) being the same as the residual mean squared error \(MS_R\) term introduced here. After computing this t statistic, we can obtain a p value from the t distribution with \(df\) degrees of freedom (which is proportional to the number of trials / time points) [8].
Our p value allows us to support a statistical inference about whether the neural activity in voxel \(j\) is different for faces and houses, for some pre-determined threshold \(\alpha\) (if \(p<\alpha\), we can reject \(H_0\)). The sign of the t statistic gives us the direction of this difference: if positive, houses > faces; and if negative, faces > houses.
It may be more intuitive to think about this graphically. Below, I've plotted the t values as an image plot, with the \(\hat{\beta}_1\) and \(\hat{\beta}_2\) parameters as axes. You can use the sliders to see the effect of changing the (effective) degrees of freedom and the \(\alpha\) threshold, on the resulting t statistics and significance (insignificant values are ghosted). The diagonal is where \(\hat{\beta}_1 = \hat{\beta}_2\).
Second-level inference
The "first-level" inference above is performed on time series for a single participant. Our inference, then, is that we can generalize from our trials to all similar conditions (viewing of faces or houses) within that individual. In other words, our "population" here is comprised of stimulus presentation events. Typically, however, we want to generalize about populations of people. In fMRI studies, this is referred to as a "second-level" inference.
We can treat a second-level inference as a mixed-effect model. This is a notoriously tricky concept, but I'll take a stab at an intuitive explanation. A mixed model contains one or more fixed effects and one or more random effects.
A fixed effect is a parameter that has a constant value, representing the full population; for a normally distributed variable, the population mean \(\mu\) represents a fixed effect. In our first-level analysis above, the \(\beta\) parameters are also fixed effects (within each participant).
A random effect is a parameter whose value is not fixed, but is sampled from a distribution. An effect can be considered random when it is estimated separately for each individual (common for longitudinal or repeated measures designs) or for dependent groups or clusters of individuals (such as students in the same school or district), and is expected to vary across individuals or groups.
In our case, we are trying to estimate a fixed effect (differential neural activity for houses versus faces in the population), but also a random effect (variability in this activation difference across individuals).
Let's say we collect fMRI data as outlined above, with \(N_t=300\) time points and \(N_s=30\) randomly sampled participants. Our data for 10 of these participants is plotted below, where each data point represents the maximal BOLD intensity for individual House trials (x-axis) versus Face trials (y-axis), and each color represents a different participant.
From this figure, it is clear that most (all except two) of the participants show a Faces > Houses effect in this voxel, indicated by their falling above the diagonal. We can also see that the data points for each participant are clustered; their means and standard deviations are shown on the right. This indicates that this voxel is more active when viewing faces than houses, for most individuals in our sample, but that there is variability both between and within individuals.
So how do we deal with this variability? One approach is to simply treat the individual participant means as data points in a new analysis — which works, but effectively throws away what we know about intra-participant variability. This essentially means we will have a poorer fit and a poorer generalization to the population [9].
Another option is to model the individual effects as random effects, with their own coefficients, alongside the fixed population effect. Our resulting mixed-effects model looks like this:
$$\mathbf{y} = \boldsymbol{\beta} \mathbf{X} + \boldsymbol{\gamma} \mathbf{Z} + \epsilon$$
Our fixed effect here is still \(\boldsymbol{\beta}\), while our \(\mathbf{X}\) matrix is obtained by concatenating all the time series for all participants (so it is of size \(N \times 3\); where \(N = N_t \cdot N_s\)), with 3 columns for intercept, houses, and faces, as for the 1st-level analysis.
Our random effect is \(\boldsymbol{\gamma}\), a \(2 N_s \times 1\) vector with two parameters for each participant; while \(\mathbf{Z}\) is a matrix of size \(N \times 2 N_s\), with two columns for each participant. \(\mathbf{Z}\) is basically a large, sparse matrix (also called a block diagonal matrix) with individual effects for each participant staggered along the diagonal.
It is probably more intuitive to see these parameters in action. Below, I've plotted concatenated BOLD time series for two participants, shown as the black line. House trials are shown as blue shaded areas and face trials as orange. You can play with 7 parameters: three fixed (population-level) \(\beta\) parameters, and two random (participant-level) \(\gamma\) parameters for each participant. These correspond to the slopes for houses (\(\gamma_1\) and \(\gamma_3\)) and faces (\(\gamma_2\) and \(\gamma_4\)):
The \(\gamma\) parameters all start out at zero. This case is a pure fixed effects model; try to fit the red (predicted) curve to the black (observed) curve by adjusting only the \(\beta\) coefficients. Next try adjusting the \(\gamma\) parameters to fine-tune the fit. The lines will adjust separately for each participant. The mean squared error (MSE) for the current model is shown at the top — see how low you can make this! (0.0032 is my personal best).
The size of the random design matrix \(\mathbf{Z}\) can get out of hand quickly, and to fit the above model can become computationally expensive. Luckily, in practice, this model can be approximated directly from the 1st level statistics we computed earlier. To do this, we take the 30 different sets of \(\textbf{C}^T \hat{\boldsymbol{\beta}}\) contrasts we estimated for each participant and create an \(N_s \times 1\) vector from them, called \(\hat{\boldsymbol{\beta}}_c\). This becomes our outcome variable, and we fit a new model of the form:
$$\hat{\boldsymbol{\beta}}_c = \boldsymbol{\beta}_g \mathbf{X}_g + \boldsymbol{\epsilon}_g$$
The \(\boldsymbol{\beta}_g\) here are the population-level coefficients we are trying to estimate, and \(\mathbf{X}_g\) is the population-level design matrix. In theory, we could use this design matrix to test for effects such as, "Does the house/face effect vary as a function of age?" or, "Does the effect differ for patients vs. non-patients?" However, our goal here is simpler: "Is there a house/face effect at the population level?"
To accomplish this goal, we can set \(\mathbf{X}_g\) to a vector of ones, meaning the \(\boldsymbol{\beta}_g\) we estimate is a single value, representing the mean activation difference (houses-faces) for all 30 participants. Our statistical inference is then: is this mean difference sufficiently large that we would not expect to find it under the null hypothesis (of \(\beta_g=0\)), given our sample size?
As before, this inference takes the form of a t statistic:
$$t=\frac{\hat{\beta}_g}{\sqrt{(\sigma_w^2 + \sigma_s^2)/N_s}}$$
where \(\sigma_w^2\) and \(\sigma_s^2\) are the variances within and between participants, respectively [10].
Dealing with multiple comparisons
We perform the above hypothesis test once for each grey matter voxel. This is called a mass univariate analysis, with the "mass" referring to the high number of voxels (typically in the \(10^5\) ballpark). Because we use \(\alpha\) to set a fixed limit on the false positive rate, we expect that the number of voxels for which we reject \(H_0\) will be \(\alpha N_v\) (where \(N_v\) is the number of voxels), even when our data are completely random. This leaves us with thousands of false positives.
There are a number of ways to deal with this accumulated (Type I) error. One way is to control the familywise error rate (FWER); or, in other words, to control the probability of at least one false positive. This is determined as:
$$FWER = 1 - (1-\alpha)^{N_v}$$
which, for \(10^5\) voxels and \(\alpha=0.05\), is \(1 - (0.95)^{10^5}\). Or 100%, basically. Correcting for this using any of the usual methods (Bonferroni, Tukey, etc.) is far too conservative (i.e., it massively inflates Type II error).
A second, more reasonable approach is called the false discovery rate (FDR) [11]. FDR is essentially the expected ratio \(Q\) of false positive tests \(V\) to the total number of observed positive tests \(R\):
$$FDR = E[Q] = E[V/R]$$
Typical FDR methods sort all the p-values in ascending order and compare them to the values expected when the null hypothesis is true for all voxels. The Benjamini-Hochberg approach, for instance, looks for the highest index \(k\) where:
$$P_k \leq \frac{k}{N_v} \alpha$$
and considers all tests from \(1\) to \(k\) as significant at \(\alpha\).
FDR is far less conservative than FWER approaches, because it loosens its threshold as a function of \(k\), instead of keeping this threshold fixed. As shown below, FDR (green) also correctly excludes false positives almost as well as Bonferroni (orange): the blue shaded area shows the 95% confidence interval for 1000 randomly-generated, sorted sets of p-values of size \(N_v\) (dots show the mean). A false positive is encountered when this area falls below the respective threshold line. Purple dots, on the other hand, represent p-values that are ordinarily highly significant (on the order of \(10^{-4}\)). These values easily survive thresholding at \(\alpha=0.05\) for both FDR and FWER with a small number of tests (\(N_v=20\), left), but with a high number (\(N_v=2000\), right), only three pass FWER correction, while all pass FDR (because \(P_{k=20}\) falls just below the line).
A third approach to this problem leverages the fact that fMRI data is spatially smooth, or in other words has spatial dependence across neighbouring voxels. This smoothness implies that we are not dealing with \(N_v\) independent tests, but rather a much smaller number of neural entities that are subsampled by voxels. This approach, called random field theory (RFT), treats a brain image as a 3D Gaussian random field, characterised by a smoothness factor called the full width at half maximum (FWHM).
The main idea behind RFT is that values in a smooth statistical map form clusters (a.k.a. "blobs"), which can be thought of as single entities of interest (i.e., a population of neurons generating an aggregate BOLD signal). Our inference therefore switches from individual voxels to individual clusters, of which there are likely considerably fewer.
Clusters are obtained when we threshold a statistical map (e.g., by setting all voxels where \(t< t_{crit}\) to zero). The number of distinct clusters created by thresholding defines its Euler characteristic (EC) [12]. If we can determine this property for a random field generated under the null distribution (where no actual effects are present), we can then use the expected number of false positive clusters as an estimate of FWE (this only holds when the threshold is sufficiently strict):
$$P_{fwe} \approx E[EC]$$
The procedure for RFT inference on an fMRI statistical map has a number of steps. Firstly, the smoothness (FWHM) must be estimated from the data, which is done using the residual errors [13]. Secondly, the FWHM is used to define the number of resolution elements (resels), which is simply how many blobs of size FWHM fit in the image. Thirdly, resels are used to compute the threshold \(t_{crit}\) at which \(E[EC] \approx P_{fwe} =\alpha\). Finally, we obtain our significant clusters by thresholding our statistical map at \(t_{crit}\).
The image plots below illustrate the RFT approach:
The top row shows a randomly generated \(100 \times 100\) matrix, smoothed with different values of FWHM. The values are standardized, meaning they approximate t-values for sufficiently large samples. The associated resels are also shown. The middle row shows these matrices thresholded at \(t>2.5\); notably (and hopefully intuitively), smoother images produce fewer but larger clusters.
The bottom row shows the distribution of EC across different thresholds. The right tail of this distribution can be used to estimate the probability of producing at least one cluster from a random matrix — or in other words, the probability of finding at least one false positive cluster (FWER for clusters). Where \(E[EC]=1\) (blue line), we expect at least one cluster on average under the null hypothesis. Where \(E[EC]=\alpha\), we define our critical t value, \(t_{crit}\) (green line). (Note, because our t-test is two-tailed, this value is actually defined at \(E[EC]=\alpha/2=0.025\)).
Statistical inference performed using this RFT approach is referred to as cluster-wise inference. We can conclude that, for all clusters formed at a threshold of \(t_{crit}\) or higher, rejecting the null hypothesis is correct 95% of the time.
Wrapping up
Phew, that was a armload of information!
If you've made it this far, you should have a fairly decent understanding of what's involved in designing and analyzing a trial-based fMRI experiment. We've covered a lot of ground, but a real understanding of these concepts comes from applying them to real data sets. The following resources can help get you started:
Have fun!
NOTE: Matlab code for the stuff above, including plots and the balloon model, is available here.
More accurately, a gamma distribution function.
A "free" parameter is one that is free to vary as we try and fit it to empirical data. A gamma function actually has two parameters, representing mean and variance; but here, following Friston et al. (1998), I am actually using a Poisson function, which is a special case of the gamma function where its mean and variance are equal; thus, one free parameter.
See this discussion for the gritty details. This page is also useful.
Technically, this is only true if we first deal with temporal autocorrelation induced by the hemodynamic response; I'm neglecting that detail here. For further info, see: Olszowy et al., 2019 and Mumford & Nichols, 2006
Details here
In the usual case, for a linear regression, our degrees of freedom would be \(N_t-3\), where \(N_t\) is the number of time points. However, because our time series data is smooth (i.e., the HRF is longer than the sample rate), neighbouring samples will not be independent of one another. Our degrees of freedom are therefore fewer than \(N_t-3\). We can use the autocorrelation function to estimate our "effective" degrees of freedom. See this reference for a solution.
Note, this assumes that within-participant variance is homogeneous across participants; see Beckmann et al., 2003
Not to be confused with false positive rate... I know, stats is a bitch
Roughly. See this paper for details.
See Kiebel et al., Neuroimage, 1999. DOI: 10.1006/nimg.1999.0508