I am an MBA student taking courses in Statistics.
I attended a seminar hosted by the Statistics faculty in my university in which students present the research they are doing.
The students had data (e.g. age, gender, race, number of children they have, which university they studied at, what they studied in university, neighborhood they live in, city they live in, province they live in , etc.) on different people from across the country - and they are building models to understand the effect of these different variables on "university graduation rate" (e.g. in the data they have, the student either "graduated" or "dropped out").
In my very naïve understanding of statistics, this kind of problem can be solved using a Logistic Regression Model. Provided that the model is fit properly and the results of the model are statistically significant (e.g. p-values), a Logistic Regression Model should be able to tell you the effect of different combinations of dependent variables on "university graduation rate". As I understand, this is closely related to the Odds Ratio - how much more likely are "students with children living in city A" to graduate university compared to "students without children in city B".
However, the students were instead presenting a type of model called a "Multilevel Regression Model" (I think this also called a "Hierarchical Model").
When I asked the students why they chose a "Multilevel Regression Model" instead of a "Logistic Regression Model", they gave me the following answer:
- It is a reasonable assumption to believe that students in the same faculty within the same university have more similar graduation rates compared to students in the same university but from different faculties. On the other hand, it is also a reasonable assumption to believe that students in the same neighborhood have similar graduation rates to one another compared to students in different neighborhoods. Apparently a Multilevel Regression Model can potentially address these "within group correlations" whereas a Logistic Regression Model can not. (Some further comments were mentioned about "Fixed Effects", "Mixed Effects" and "Random Effects" - but I could not fully understand these concepts nor understand the relevance of these concepts when comparing the aptness of Logistic Regression vs. Multilevel Logistic Regression.)
While this sounds like a reasonable justification as to why one would prefer a Multilevel Regression Model compared to a Logistic Regression Model - due to my lack of knowledge in Statistics, I have no choice but to blindly accept this justification.
In my opinion, a Logistic Regression Model can provide estimates and help understand the effects of different independent variables on the dependent variable - for example, all students in the same neighborhood should all be subject to the same effect of this neighborhood on the university graduation rate. Therefore, in this regard, it seems to me that both the Logistic Regression Model and the Multilevel Regression Model can take into consideration similarities within groups when estimating the effect of independent variables on the dependent variable.
Thus, why might the students from the seminar chosen a Multilevel Regression Model compared to a Logistic Regression Model - given that both models can take into consideration the effects of similar groups when estimating the effect of independent variables on the dependent variable?
Some Final Notes:
Even though a Logistic Regression Model can estimate the effect of independent variables on the dependent variable, a Logistic Regression Model assumes that all observations are independent from one another (I.I.D.), whereas a Multilevel Regression model does not make this assumption (i.e. cluster/group correlation)
If this is really the case - why not just fit a whole bunch of individual models for each combination of factors within the independent variables? I understand that this might result in fitting a large number of models, but aren't modern computers strong enough to do this?