I've just spent two days talking machine learning concepts with some of the excellent folks behind the Nilearn Python project. We had some fruitful conversations, both about the possibilities and limitations of approaches implemented by the Nilearn package, and I wanted to get a few of these ideas down here.
Multivariate analyses
In contrast to the more classical univariate analysis (i.e., as implemented in SPM software) multivariate analysis (also known as multivariate pattern analysis, or MVPA), treats all voxels in an image or ROI as elements in a single statistical model. So, while SPM runs a single general linear model for each voxel independently, and subsequently deals with the multiple comparisons problem, MVPA puts all voxels in (for instance) a single GLM, and attempts to solve that in one shot. A typical problem looks like this:
\[ \mathbf{y} = \mathbf{\beta} \cdot \mathbf{X} + \varepsilon \]
In this equation, \(\mathbf{y}\) is our response vector, or the thing we want to predict. \(\mathbf{X}\) is our voxel-wise feature matrix, which can be an fMRI BOLD time series, or any other set of image intensity values we want to use to predict \(\mathbf{y}\). \(\beta\) are the model weights we are trying to fit to the data. As an example, suppose we have an experiment in which different categories of visual stimuli are presented, such as houses and faces. We acquire 2 mm resolution fMRI with 150 time points, while alternately presenting these stimuli. We want to know whether the fMRI BOLD response we observe while presenting these stimuli can be used to discriminate between them. This is known as a classification problem.
There are many approaches to classification, but our big problem is this: for each subject, we have far more features ( \(p > 100{,}000\) ) than we have time points ( \(n=150\) ). Such a problem is called ill-posed, because there are many more parameters ( \( \beta \) ) than there are observations ( \(\mathbf{y}\) ). Because of this, the usual estimation methods, such as ordinary least squares, are not applicable. Instead, we need to formulate the problem as an optimization problem, such that we are trying to minimize the error of the model prediction while also minimizing the complexity of the model. The general problem, known as linear least squares, looks like this:
\[ \mathop{\textrm{argmin}_{\beta}} ( {\left\lVert \mathbf{y} - \beta \cdot \mathbf{X} \right\rVert}_2^2 ) \]
in which we are trying to adjust the \(\beta\) weights in order to minimize the prediction error (\(\varepsilon\) in the above equation) between our observations \(\mathbf{y}\) and our model \(\beta \cdot X\), using the squared \(L_2\) norm, or Euclidean distance.
Regularization
Regularization is done to impose sparsity on our feature matrix \( \mathbf{X} \). In other words, since we assume that many, if not most, of our features (voxels) will not be discriminative for faces and houses, we can reduce the complexity of our problem by introducing a penalization term in our optimization problem. To reduce complexity, we want to minimize the sum of the \(\beta\) weights. Indeed, we would like many of them to be reduced to zero, and effectively eliminated from the model. Two well-known regularization approaches are ridge regression, using \(L_2\) norm:
\[ \mathop{\textrm{argmin}_{\beta}} ( {\left\lVert \mathbf{y} - \beta \cdot \mathbf{X} \right\rVert}_2^2 + \lambda \cdot {\left\lVert \beta \right\rVert}_2^2 ) \]
And LASSO, using the \(L_1\) norm:
\[ \mathop{\textrm{argmin}_{\beta}} ( {\left\lVert \mathbf{y} - \beta \cdot \mathbf{X} \right\rVert}_2^2 + \lambda \cdot {\left\lVert \beta \right\rVert}_1 ) \]
In both cases, we've introduced the tuning parameter \( \lambda \), which can be used to control the degree of sparsity we want to impose. They really only differ in the type of norm operation applied. Ridge regression, which uses the \(L_2\) norm, results in a smooth change in \(\beta\) weights over values of \(\lambda\), and as a result reduces them only to very small, but non-zero values. For LASSO on the other hand, the \(L_1\) norm reduces many \(\beta\) weights to zero, and tends to retain only a few very highly predictive voxels.
LASSO is a valuable approach in genome-wide association studies (GWAS), as it is often desirable to isolate a small subset of genetic sequences as predictive for a variable of interest. However, when applied to neuroimaging data, it will tend to penalize all but one voxel in a cluster with similar time series. In other words, it produces solutions which are too sparse to represent realistic neuroimaging patterns, where we expect spatial smoothness. On the other hand, ridge regression can produce smooth patterns, but it does not introduce sparsity since most \(\beta\) weights still have non-zero values.
The compromise for this is called the elastic net. It is, quite simply, a weighted average of these two regularization terms, in which the trade-off between smoothness and sparsity is expressed as the parameter \(\alpha\):
\[ \mathop{\textrm{argmin}_{\beta}} ( {\left\lVert \mathbf{y} - \beta \cdot \mathbf{X} \right\rVert}_2^2 + \lambda \cdot [ (1-\alpha) \cdot {\left\lVert \beta \right\rVert}_1 + \alpha \cdot {\left\lVert \beta \right\rVert}_2^2 ] ) \]
An extension of the elastic net approach, called GraphNet, further manipulates the regularization term in order to explicitly impose spatial smoothness on the solution.
Cross-validation
Another big problem with fitting our ill-posed multivariate linear model is that of over-fitting, in which we find a solution that fits the training data well, but this performance doesn't generalize to other independent samples. Since generalizability is the goal of any empirical research, we need to address the issue of over-fitting. The standard solution for this is to use cross-validation, in which we use independent subsets of our sample, or folds, as "training" and "testing" samples. In \(k\)-fold cross-validation, we divide the sample into \(k\) folds, and for each fold we use the remaining data to fit the elastic net model, and then derive a model error from the fold itself. An average of these prediction errors then gives a measure of the generalizability of the model.
In practice, we also want to perform this cross-validation procedure across different values of our tuning parameters \(\alpha\) and \(\lambda\). This is called nested cross-validation, and obtaining prediction errors across many values of these parameters can be used to select the best model for our classification purpose.
Some caveats about interpretation
Say you've collected your data, run it through Nilearn's GraphNet optimization, and obtained a cross-validated predictive model with which you can discern whether the input was faces or houses, with 90% accuracy. That's pretty cool. But as a cognitive neuroscientist, you want to dig deeper. What can this model tell us about the neural mechanisms of this discrimination? The obvious route to answering this question is to look at the \(\beta\) weights, since there is one per voxel, and higher weights clearly contribute more to the model prediction, while zero weights contribute nothing. Mapping these values back to our image, we might get a result that looks like:
The trick is how to interpret this. We would like to be able to make statements such as "the yellow voxels represent regions for which the brain is activated preferentially for faces than for houses". But we cannot actually derive such an inference from our results. Why? In contrast to univariate methods, where every voxel is modelled independently, our multivariate result cannot be interpreted in terms of solitary voxels. Because each \(\beta\) is dependent on the others in the model, removing one voxel from the model will affect the optimal result for all the others, and it is the combination of information across terms that leads to the model's predictive power. We can say that the voxels with high absolute \(\beta\) weights have more discriminatory power for faces versus houses, but we can't go much further on the basis of these results alone.
Some remedies for this limitation? Firstly, it is always a good idea to consider complementary analyses. For instance, while univariate models do not have the same predictive power as multivariate, they do have the advantage of permitting inferences on single voxels. So, if we ran a parallel univariate analysis in which we identified some of the same voxels as being more strongly activated for faces over houses, we can present these results in combination and support more interesting inferences about mechanisms of this perceptual function. It is also possible to use the results of the multivariate analyses to define a ROI for the univariate, with one major caveat: to avoid the "double dipping" conundrum, we would have to do this on an independent sample (e.g., a split-half design, if your sample size is large enough for this to be feasible).
Another possibility is the use of a searchlight analysis. Here, we define spheres of a certain radius around a specific voxel, and perform our multivariate analysis only for voxels in this sphere. This "searchlight" can then be applied to all voxels of interest, and the resulting predictive accuracy of each sub-model can be assigned to its center voxel. This approach provides somewhat of a hybrid between multi- and univariate approaches, in that it can be used to support stronger statements about individual voxels (and their immediate neighbours), but still retain some of the power of the multivariate approach. It is justified by the assumption of smoothness in neuroimaging data, although the ideal searchlight radius parameter is unknown (but can conceivably be estimated through model selection in a similar manner to \(\lambda\) and \(\alpha\), above).
In conclusion, this method was exciting to learn about, and the Nilearn package provides a very user-friendly and well-documented means of performing these sorts of analyses on your data. I certainly plan to incorporate these methods into my future research designs.