Name | Height |
---|---|
Tyler | 181 |
Steve | 190 |
Jenny | 173 |
Cindy | 158 |
4 Statistical Models
In psychological research, a statistical model is a mathematical framework used to represent, analyze, and make predictions about data. It is often used to explain relationships between variables, such as the effect of one or more independent variables (predictors) on a dependent variable (outcome). Statistical models help researchers to:
1. Summarize Data
They provide a simplified representation of complex data, making it easier to understand and interpret patterns or trends.
2. Test Hypotheses
Researchers use statistical models to evaluate whether there are significant relationships between variables, or whether observed patterns in the data could have occurred by chance.
3. Make Predictions
Based on the relationships identified in the data, models can predict future outcomes or behaviors.
When we use theory and generate hypotheses, we can translate our hypotheses into statistical models to test. We are attempting to create a model of real-world phenomenon. For example, consider the following theory proposed by Edwin Shneidman: suicide is caused by psychache (unbearable mental pain). Imagine this was truly how suicide worked in the real world. Consider the following three possible models:
- Researcher 1: as pyschache increases so does suicide risk
- Researcher 2: as social connectedness decreases, suicide risk increases
- Researcher 3: presence of gene XX4r2 causes suicide
Each researcher could collect data to test how well their hypothesis fits the data that they collect. Each hypothesis can be represented as a model that is statistically testable.
- Researcher 1:
- The correlation between \(x\) (psychache) and \(y\) (suicide risk) should be positive.
- \(r_{x, y}>0\)
- Researcher 2:
- The correlation between \(x\) (social connectedness) and \(y\) (suicide risk) should be negative
- \(r_{x, y}<0\)
- Researcher 3:
- If \(x\) occurs (someone dies by suicide), then \(y\) should also occur (the gene should be present in a biopsy)
- The proportion of people who die by suicide who also have the gene should be 100%
- \(p_{(gene|suicide)}=1.00\)
Each researcher would collect slightly different data and analyse it differently. If the model fits the data well, it provides support for the hypothesis and theory. If it does not fit the data well, it likely does not accurately represent the real-world phenomenon of interest. For example, if researcher 3 collected genetic data and the presence of the hypothetical gene did not lead to suicide in some individuals, it would indicate a poor model fit. The results of statistical analyses that test your model can give indication of fit.
Models can be complex or simple. Again, the research question and hypothesis precede the research design–and, subsequently, the model.
4.1 A Basic Model
Let’s try to model the mean height of psychology professors (in centimeters). You cannot measure all the psych professors in the world. Instead, you go to the Arts and Sciences Building at Grenfell Campus and measure the heights of four of your psychology professors. You get the following data.
The average height of these professors is 175.5cm. This mean is a model. The model can be represented as:
\(x_i = \overline{x} + e_i\)
Here: \(x_i\) presents the height of professor i, \(\overline{x}\) represents the sample mean height of the professors; and \(e_i\) represent the difference between the professor and the mean, or errors. These are also sometimes referred to as residuals.
We can assess how well the model fits with the data we collected. For our model, it would make sense to try to calculate how large our \(e_i\)s are, as these represent the model error. If our model does a poor job, errors will be higher compared to a model that does a good job.
4.2 Deviations
One method to assess the quality of the fit of the model, our mean, to the data is compare how different our data are from the model. You now know that these are model errors. We can subtract the mean from each value to create a numerical representation of this fit. For example, Tyler is 181cm tall. Our model suggests that the average height is 175.5cm tall. We can calculate the deviation here as:
\(e_i = (x_i - \overline{x}) = (181 - 175.5) = 5.5\)
The following are the deviations for each individual.
Name | Deviation |
---|---|
Tyler | 5.5 |
Steve | 14.5 |
Jenny | -2.5 |
Cindy | -17.5 |
If we sum all the errors up across all our data, we get:
So, \(\sum{e_i}=5.5 + 14.5 + (-2.5) + (-17.5) = 0\). What?? That can’t be right. Does this mean that this is the perfect model? No. In many models, the sum of raw residuals will equal 0:
\(\sum_{i=1}^n{e_i}=0\)
There is a way to bypass this statistical conundrum.
4.3 Variance and Standard Deviation
We may effectively model the fit of our mean model with the variance and standard deviation. These are extremely important in statistics so it’s imperative to become familiar with them (see a previous chapter where these are explained in detail).
Above we calculated the the deviation of each score. The variance is, in essence, the average squared difference between a score and its mean.
\(\sigma^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N} }\)
But for a sample, our equation is (see last chapter for the rationale):
\(s^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} }\)
This equation simply means we add up all the squared differences between a score and the mean and divide by the number of scores. So, the squared deviations are:
Name | Deviation | Squared |
---|---|---|
Tyler | 5.5 | 30.25 |
Steve | 14.5 | 210.25 |
Jenny | -2.5 | 6.25 |
Cindy | -17.5 | 306.25 |
We then add up the squared deviations, \(30.25+210.25+6.25+306.26=553\). And divide by the number of scores (with sample adjustment to \(N-1\)), \(4-1=3\), to get:
\(\sigma^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} } = \frac{30.25+210.25+6.25+306.26}{4-1} = \frac{553}{3}=184.33\)
Thus, the variance of the heights of psychology professors is \(184.33\). The standard deviation is simply the squared root of the variance:
\(s = \sqrt{{\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} }}=\sqrt{184.33}=13.58\)
While you might think that the standard deviation (SD) is the average absolute difference between a score and the mean, it is not. For example, the SD of our heights is 13.58. But the average deviation is, in fact, \(\frac{|5.5| + |14.5 |+ |-2.5| + |-17.5|}{4} = 13.33\). It is most likely helpful to think of the variance as the average squared deviation and the SD as the root of the variance.
Instead of using the mean in the above model, use a value of 190cm. Our new model would be:
\(x_i = 190 + e_i\)
Calculate the errors, variance and SD using this new model. Was the variance higher, the same or lower?
Which model seemed better? The one using the mean or 190cm?
Name | Height | NewDeviation | NewSquaredDeviation |
---|---|---|---|
Tyler | 181 | -9 | 81 |
Steve | 190 | 0 | 0 |
Jenny | 173 | -17 | 289 |
Cindy | 158 | -32 | 1024 |
The sum of these new deviations is 1394.
The variance of these is 464.67.
The SD is 21.56.
4.4 Advanced Models
While above we have simply modeled a mean, later chapters will build up to more advanced models, such as:
\(y_i=\beta_0+x_{1i}\beta_1+x_{2i}\beta_2+x_{3i}\beta_3+e_i\)
Don’t be intimidated, this is a whole lot like your classic high school’s \(y=mx+b\), with some intercepts and slopes. More to come. For now, a brief overview of some potential models will do. I will note that many common statistically models fall under the broader umbrella of general linear models (GLM). Some common types of statistical models in psychological research include:
1. Linear regression models
These assess the relationship between one or more independent variables and a continuous dependent variable. For example, predicting levels of anxiety based on hours of sleep.
2. ANOVA (Analysis of Variance)
This model compares the means of different groups to determine if they are significantly different from one another, often used in experimental studies.
3. Structural equation modeling (SEM)
SEM is a more complex statistical model that can evaluate multiple relationships between variables simultaneously, including latent (unmeasured) variables.
Statistical models are essential for drawing conclusions about psychological phenomena, helping researchers identify patterns, test theoretical models, and inform practice. We will revisit the idea of models throughout each chapter that follows.
- Calculate the mean, variance, and standard deviation for both the height (in cms) and weight (in kgs) of these NHL players.
Player | Height | Weight |
---|---|---|
Connor McDavid | 185.4 | 99.0 |
Auston Matthews | 190.5 | 93.0 |
Sidney Crosby | 180.0 | 91.0 |
Alex Ovechkin | 191.0 | 107.9 |
Write out the model for NHL height.
What are the \(e_i\) values for each player when modeling their height?
Mean_Height | 186.725000 |
SD_Height | 5.148058 |
var_Height | 26.502500 |
Mean_Weight | 97.725000 |
SD_Weight | 7.587435 |
var_Weight | 57.569167 |
- Write out the model for NHL height.
\(height_i=\overline{x}_{height}+e_i\)
- What are the \(e_i\) values for each player when modeling their height?
Player | Height | e_i |
---|---|---|
Connor McDavid | 185.4 | -1.325 |
Auston Matthews | 190.5 | 3.775 |
Sidney Crosby | 180.0 | -6.725 |
Alex Ovechkin | 191.0 | 4.275 |