Andrew Reid - Ph.D.

Blog posts

Intro

Linear regression: dealing with skewed data
Published on 2020-11-17 by Andrew Reid	#13

The skewed regression

In this blog post, I will focus on a pre-print article (available here) that purports to show that hours of video game usage is associated with higher subjective well-being. This manuscript was recently featured in this BBC article (despite its not yet being peer reviewed), with a headline suggesting that video games are "good for well-being".

This generalization has a number of interesting aspects (including the fact that the game was specifically a "social" type game involving interactions with other players, and the fact that data were collected in August 2020, 5 months into the COVID-19 pandemic), but these are not the focus of this post. Additionally, my goal here is not to pick on the authors of this particular study, but rather to explore two issues it raises in further depth than I have attempted before.

These issues are:

Large sample sizes will produce significant results even when effect sizes are minuscule (and largely uninterpretable)
Skewed distributions can produce a result that is biased towards data points in the heavy tail

The analysis in question, based on ~6000 participants, is portrayed in Figure 3 of the pre-print. Hours of games played in a two-week window is plotted on the X axis, and subjective well-being is plotted on the Y axis. The immediate concern is that the data points show a strong positive skew: sparse data points corresponding to people with extremely high gameplay hours (50+) appear to be driving the upward slope of the regression line.

This is concerning — and what's more concerning is that the regression itself, while significant, has an \(R^2=0.01\), indicating that only 1% of the variance in the data is shared between these two variables. This is scant evidence on which to conclude the observed relationship is generalizable to the general population. Another concern we see is a floor effect for gaming hours: many participants have zero hours, indicating that they do not play video games at all, and should likely be removed from this analysis.

Simulating the data

Matlab code for what follows is available here.

A great way to delve into these sorts of situations is to simulate your data. This allows you to control the factors you are interested in — such as skewness and effect size — and see how these influence the resulting statistics. A simple way to simulate the skewed distribution from Figure 3 (specifically, the red graph showing results for "Animal Crossing") is shown below. Here I've generated two sets of random Gaussian variables: one to simulate the vast proportion of participants who played between 0 and 20 hours per week (blue), and another to simulate the players with more extreme levels of gameplay (green).

This bears at least some resemblance to the data of Figure 3. Importantly, this simulation approach allows us to vary the sparsity of the green data points as a percentage of the blue ones — here I've chosen 5%. This translates closely to the proportion of "extreme" values — if we define this as \(X>20\) above, we have roughly 4.5% extreme values.

We can also define the offset between the mean of our green and blue distributions. This offset will influence the correlation between our two variables.

It should be clear where this is going: we should be able to see whether, and to what extent, the relatively few extreme data points will influence our resulting statistics. This simulation will also allow us to see the influence of applying a data transformation, such as a logarithmic function, on these statistics.

Extreme values disproportionately influence regression results

The image above shows some simulated data, with about 4.5% of the data points being defined as "extreme", where \(X>20\) hours (shaded areas are 95% confidence intervals). The grey plots at left show the distribution and corresponding regression model when our extreme distribution is not offset from the main one — as expected, there is no association evident. The red plots show a positive offset of 1.5. While these are not the actual data points of the study in question, they look similar enough — and the positive offset produces a similar statistical result of \(R^2=0.01\).

This plot tells us two things:

A shift in only 4.5% of the extreme data points resulted in a change from nonsignificant to significant, and produced a result similar to that reported in the article
A log transformation of the data, to remove the substantial positive skew, reduces this effect

Amazingly, the regression models for the log-transformed data still produce significant results at the conventional \(\alpha=0.05\) threshold, but with greatly reduced effect size (about a third of that for the skewed data). This shows just how easily one can obtain a "significant" result from a large sample, and how important it is to go beyond the p-value threshold and consider the actual effect size when interpreting such models.

Exploring proportion and offset of skewed data points

Both the number of data points in the heavy tail (as a proportion of total points) and the magnitude of the mean offsets will influence the statistical results. These relationships are explored in the two sets of plots below. Here, I generated 50 random skewed data sets to get a better idea of how these effects generalize, and varied either the proportion of data points in the skewed set, or the magnitude of the mean offset.

The top plot tells us that it is possible to obtain a model with \(p<0.05\) with a very small percentage (around 2.5%) of skewed data points. While this is shifted only slightly by log-transforming the data, this step reduces the \(R^2\) value by about half. This basically shows that significant testing is trivial with such a large data set.

The bottom plot tells us that we can also obtain a significant result with a relatively small shift in the mean of skewed data points (+0.5, equivalent to 0.3 standard deviations) from the mean of the main body of points. Again, the log transformation reduces this effect.

Determining the degree of influence of skewed data points

Finally, we can estimate the degree to which each data point influences the regression result by comparing the result to that of a model where that data point is removed. I've done this in two ways: by computing the change in effect size, \( \Delta R^2\), and by computing Cook's distance, which is obtained from the equation:

\[ D_i=\frac{\sum_j^n{(\hat{y_j} - \hat{y_{j(i)}})^2}}{ps^2} \]

Here, \(\hat{y_{j(i)}}\) refers to the predicted value of data point \(j\) when \(i\) is removed from the model, \(p=1\) (the number of predictor variables), and \( s=\frac{\mathbf{e}^T \mathbf{e}}{n-p}\), where \(\mathbf{e}\) is the residual error.

These plots show very clearly the disproportionate influence of extreme data points on the statistical outcomes of the regression analysis: participants with 50 or more hours will have a substantially greater influence on the result than those with fewer hours. This influence is drastically reduced when we apply a log transformation.

Summing up

So how do we distill the above considerations? We can see that the heavy tails of skewed distributions can disproportionately influence the regression outcome, but is this necessarily a problem? Arguably, these data points represent the more extreme end of the scale, and because they are apparently more sparse, why shouldn't they have more influence on the regression line?

However, if we are attempting to infer about the general population, it is unlikely that biasing the result towards a relatively few number of outliers allows us to support this inference. There may (and likely are) attributes unique to these individuals that differentially affect subjective well-being compared to individuals with less extreme levels of gameplay (what conditions are necessary to support 25+ hours per week, for example?). It is also possible that this disproportionate influence could be masking effects in the general population.

The above also highlights the utility of an appropriate data transformation, like a log transformation, for reducing the bias imposed by the skewness of a distribution. Log transforms both reduced the apparent effect size and squashed the disproportionate influence of extreme data points, by ensuring the distribution was more uniform across the range of the predictor variable.

Possibly the most important take-home here, though, is the importance of emphasizing the size of the effect, rather than the statistical significance of the p-value at some arbitrary threshold like \( \alpha=0.05 \). An \(R^2\) value of 0.01 represents a very small effect size: basically, regardless of the regression coefficient, this means that the variance of gameplay time explains only 1% of the variance of subjective well-being. That this effect is actually likely inflated due to the skewness only exacerbates this issue.

To conclude on a positive note, however: the findings are still intriguing, and highlight some interesting questions, such as: what is special about individuals who play video games for 25+ hours per week? Are these individuals using the game as a surrogate for social interaction? Do they substantially differ in socioeconomic status? Is this effect a result of the current COVID-19 lockdown, i.e., by providing a way to stay engaged and connected despite being largely confined to one's home?

The authors consider a number of additional factors in order to better elucidate the observed patterns, are careful to acknowledge the small effect size, and emphasize that they "cannot claim that game time causally affects well-being" based on the presented evidence. More generally, the use of large data sets such as this, particularly with a larger variety of game types, may help us understand how this activity influences mental health, whether and under what conditions it constitutes a form of addiction, and how public policy might be designed to deal with it.

Comments here

One important caveat when working with large datasets is that you can almost always produce a statistically significant result when performing a null hypothesis test. This is why it is even more critical to evaluate the effect size than the p value in such an analysis. It is equally important to consider the distribution of your data, and its implications for statistical inference. In this blog post, I use simulated data in order to explore this caveat more intuitively, focusing on a pre-print article that was recently featured on BBC.

Tags:Linear regression · Correlation · Skewness · Stats

Causal discovery: An introduction
Published on 2024-09-23 by Andrew Reid	#21

This post continues my exploration of causal inference, focusing on the type of problem an empirical researcher is most familiar with: where the underlying causal model is not known. In this case, the model must be discovered. I use some Python code to introduce the PC algorithm, one of the original and most popular approaches to causal discovery. I also discuss its assumptions and limitations, and briefly outline some more recent approaches. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.

Tags:Stats · Causality · Causal inference · Causal discovery · Graph theory · Teaching

Causal inference: An introduction
Published on 2023-07-17 by Andrew Reid	#20

Hammer about to hit a nail, representing a causal event.

In this post, I attempt (as a non-expert enthusiast) to provide a gentle introduction to the central concepts underlying causal inference. What is causal inference and why do we need it? How can we represent our causal reasoning in graphical form, and how does this enable us to apply graph theory to simplify our calculations? How do we deal with unobserved confounders? This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.

Tags:Stats · Causality · Causal inference · Graph theory · Teaching

Multiple linear regression: short videos
Published on 2022-08-10 by Andrew Reid	#19

In a previous series of posts, I discussed simple and multiple linear regression (MLR) approaches, with the aid of interactive 2D and 3D plots and a bit of math. In this post, I am sharing a series of short videos aimed at psychology undergraduates, each explaining different aspects of MLR in more detail. The goal of these videos (which formed part of my second-year undergraduate module) is to give a little more depth to fundamental concepts that many students struggle with. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.

Tags:Stats · Linear regression · Teaching

Learning about multiple linear regression
Published on 2021-12-30 by Andrew Reid	#18

In this post, I explore multiple linear regression, generalizing from the simple two-variable case to three- and many-variable cases. This includes an interactive 3D plot of a regression plane and a discussion of statistical inference and overfitting. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.

Tags:Stats · Linear regression · Teaching

Learning about fMRI analysis
Published on 2021-06-24 by Andrew Reid	#17

In this post, I focus on the logic underlying statistical inference based on fMRI research designs. This consists of (1) modelling the hemodynamic response; (2) "first-level" within-subject analysis of time series; (3) "second-level" population inferences drawn from a random sample of participants; and (4) dealing with familywise error. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.

Tags:Stats · FMRI · Hemodynamic response · Mixed-effects model · Random field theory · False discovery rate · Teaching

Learning about simple linear regression
Published on 2021-03-25 by Andrew Reid	#16

In this post, I introduce the concept of simple linear regression, where we are evaluating the how well a linear model approximates a relationship between two variables of interest, and how to perform statistical inference on this model. This is part of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology.

Tags:Stats · Linear regression · F distribution · Teaching

New preprint: Tract-specific statistics from diffusion MRI
Published on 2021-03-05 by Andrew Reid	#15

In our new preprint, we describe a novel methodology for (1) identifying the most probable "core" tract trajectory for two arbitrary brain regions, and (2) estimating tract-specific anisotropy (TSA) at all points along this trajectory. We describe the outcomes of regressing this TSA metric against participants' age and sex. Our hope is that this new method can serve as a complement to the popular TBSS approach, where researchers desire to investigate effects specific to a pre-established set of ROIs.

Tags:Diffusion-weighted imaging · Tractography · Connectivity · MRI · News

Learning about correlation and partial correlation
Published on 2021-02-04 by Andrew Reid	#14

This is the first of a line of teaching-oriented posts aimed at explaining fundamental concepts in statistics, neuroscience, and psychology. In this post, I will try to provide an intuitive explanation of (1) the Pearson correlation coefficient, (2) confounding, and (3) how partial correlations can be used to address confounding.

Tags:Stats · Linear regression · Correlation · Partial correlation · Teaching

Functional connectivity as a causal concept
Published on 2019-10-14 by Andrew Reid	#12

In neuroscience, the conversation around the term "functional connectivity" can be confusing, largely due to the implicit notion that associations can map directly onto physical connections. In our recent Nature Neuroscience perspective piece, we propose the redefinition of this term as a causal inference, in order to refocus the conversation around how we investigate brain connectivity, and interpret the results of such investigations.

Tags:Connectivity · FMRI · Causality · Neuroscience · Musings

Functional connectivity? But...
Published on 2017-07-26 by Andrew Reid	#11

Functional connectivity is a term originally coined to describe statistical dependence relationships between time series. But should such a relationship really be called connectivity? Functional correlations can easily arise from networks in the complete absence of physical connectivity (i.e., the classical axon/synapse projection we know from neurobiology). In this post I elaborate on recent conversations I've had regarding the use of correlations or partial correlations to infer the presence of connections, and their use in constructing graphs for topological analyses.

Tags:Connectivity · FMRI · Graph theory · Partial correlation · Stats

Driving the Locus Coeruleus: A Presentation to Mobify
Published on 2017-07-17 by Andrew Reid	#10

How do we know when to learn, and when not to? Recently I presented my work to Vancouver-based Mobify, including the use of a driving simulation task to answer this question. They put it up on YouTube, so I thought I'd share.

Tags:Norepinephrine · Pupillometry · Mobify · Learning · Driving simulation · News

Limitless: A neuroscientist's film review
Published on 2017-03-29 by Andrew Reid	#9

In the movie Limitless, Bradley Cooper stars as a down-and-out writer who happens across a superdrug that miraculously heightens his cognitive abilities, including memory recall, creativity, language acquisition, and action planning. It apparently also makes his eyes glow with an unnerving and implausible intensity. In this blog entry, I explore this intriguing possibility from a neuroscientific perspective.

Tags:Cognition · Pharmaceuticals · Limitless · Memory · Hippocampus · Musings

The quest for the human connectome: a progress report
Published on 2016-10-29 by Andrew Reid	#8

The term "connectome" was introduced in a seminal 2005 PNAS article, as a sort of analogy to the genome. However, unlike genomics, the methods available to study human connectomics remain poorly defined and difficult to interpret. In particular, the use of diffusion-weighted imaging approaches to estimate physical connectivity is fraught with inherent limitations, which are often overlooked in the quest to publish "connectivity" findings. Here, I provide a brief commentary on these issues, and highlight a number of ways neuroscience can proceed in light of them.

Tags:Connectivity · Diffusion-weighted imaging · Probabilistic tractography · Tract tracing · Musings

New Article: Seed-based multimodal comparison of connectivity estimates
Published on 2016-06-24 by Andrew Reid	#7

Our article proposing a threshold-free method for comparing seed-based connectivity estimates was recently accepted to Brain Structure & Function. We compared two structural covariance approaches (cortical thickness and voxel-based morphometry), and two functional ones (resting-state functional MRI and meta-analytic connectivity mapping, or MACM).

Tags:Multimodal · Connectivity · Structural covariance · Resting state · MACM · News

Four New ANIMA Studies
Published on 2016-03-18 by Andrew Reid	#6

Announcing four new submissions to the ANIMA database, which brings us to 30 studies and counting. Check them out if you get the time!

Tags:ANIMA · Neuroscience · Meta-analysis · ALE · News

Exaptation: how evolution recycles neural mechanisms
Published on 2016-02-27 by Andrew Reid	#5

Exaptation refers to the tendency across evolution to recycle existing mechanisms for new and more complex functions. By analogy, this is likely how episodic memory — and indeed many of our higher level neural processes — evolved from more basic functions such as spatial navigation. Here I explore these ideas in light of the current evidence.

Tags:Hippocampus · Memory · Navigation · Exaptation · Musings

The business of academic writing
Published on 2016-02-04 by Andrew Reid	#4

Publishers of scientific articles have been slow to adapt their business models to the rapid evolution of scientific communication — mostly because there is profit in dragging their feet. I explore the past, present, and future of this important issue.

Tags:Journals · Articles · Impact factor · Citations · Business · Musings

Reflections on multivariate analyses
Published on 2016-01-15 by Andrew Reid	#3

Machine learning approaches to neuroimaging analysis offer promising solutions to research questions in cognitive neuroscience. Here I reflect on recent interactions with the developers of the Nilearn project.

Tags:MVPA · Machine learning · Nilearn · Elastic net · Statistics · Stats

New ANIMA study: Hu et al. 2015
Published on 2016-01-11 by Andrew Reid	#2

Announcing a new submission to the ANIMA database: Hu et al., Neuroscience & Biobehavioral Reviews, 2015.

Tags:ANIMA · Neuroscience · Meta-analysis · ALE · Self · News

Who Am I?
Published on 2016-01-10 by Andrew Reid	#1

Musings on who I am, where I came from, and where I'm going as a Neuroscientist.

Tags:Labels · Neuroscience · Cognition · Musings

Andrew Reid PhD