Have you ever thought, how a simple looking contingency table where the data is organized into rows and columns can be used to explore relationship between two categorical variables?
Well, just the counts in the contingency table doesn’t really help answer our question as no pattern is readily evident. However, if we use proportions (or percentages), we can start seeing a relationship.
Marginal and Conditional Distribution
The percentages give us the marginal and conditional distribution of our variables. The marginal distribution of a categorical variable shows what is expected in the normal state of affairs if we don’t consider the other categorical variable at all.
The idea is very simple, we need to look at both our marginal and conditional distributions together to discover if there is a relationship between our categorical variables.
We need to see if the marginal distribution holds up when we compare it to the conditional distribution of our variable of interest.
Mathematically, if P(A|B) = P(A) then there is no relationship.
Let us understand this through an example. We have the Titanic dataset and the question we are asking is:
Is there any association between passenger class and survival rate.
The contingency table of the data is shown below:
First | Second | Third | Grand Total | |
Survived | 136 | 87 | 119 | 342 |
Not Survived | 80 | 97 | 372 | 549 |
Grand Total | 216 | 184 | 491 | 891 |
Here our outcome variable is Survivorship, so we need to know how this variable behaves in normal state of affair which is its marginal distribution. The below table shows marginal and conditional distribution of the outcome variable.
First | Second | Third | Grand Total | |
Survived | 63% | 47% | 24% | 38% |
Not Survived | 37% | 53% | 76% | 62% |
Grand Total | 100% | 100% | 100% | 100% |
Typical survival percentage in Titanic shipwreck is 38%, if we do not take class into account. However, the likelihood of survival is 63% for a first class passenger. The same disparity in the conditional and marginal probabilities is seen for the second class and the third class passengers.
We conclude that our conditional probability of survival within any of our classes doesn’t actually match our marginal probability if we don’t take class into consideration.
Thus we can say that there is strong relationship between survival rate and passenger class.
Summary
In summary, here are the steps needed to determine if there is a relationship between two categorical variables.
- Identify the variable of interest or outcome variable in our contingency table.
- Determine marginal distributions and conditional distributions for the outcome variable
- Compare the two distribution probabilities (conditional and marginal).
- We have a relationship between the categorical variables, if the marginal distribution does not come close to conditional distribution.
Chi square test of independence is a statistical way of testing the independence of two categorical variables using the same principle.
Check out this post to see a how chi-square test can be used to test the independence of two categorical variables.
#Statistics #Chisquare #DataAnalysis pic.twitter.com/QX0tREfa34
— Logort (@TheLogort) November 19, 2023