0
$\begingroup$

Background
I want to compare chatbot (e.g. ChatGPT) performance on medical questions with a group of human doctors. Each question will be answered and scored on an ordinal scale 1, 2, 3, 4 or 5. I want to test for a significant difference in scores between these two groups (chatbot vs doctors).

Problems
The fact that the score is ordinal means I can't use parametric tests and ideally shouldn't use the means of the scores. Also, each question is potentially of a different difficulty-level, so I don't think it would be valid to count the measurements as independent. On top of that, each doctor may be of a different skill level, so the scores from a single doctor are dependent on who answered.

Solutions considered so far

Averaging
I have seen some studies just taking the average of the scores for each question in the human group. This doesn't seem completely valid to me. An ordinal scale shouldn't be averaged, ideally. Also, it seems we would loose a lot of information about the confidence in our result(?)

Mixed effects model
Given all the above, I came to the conclusion that I should use a Mixed-effects ordinal logistic regression, also known as using Cumulative Link Mixed Models. I could then use the score as the dependent variable, the question id and participant id as random effects, and the group ("chatbot" or "human") as the dependent variable. I'm now trying this method out on a fake dataset to try to come up with the sample size needed.

It kind of works, but there is a problem. In group "human" I have several doctors. In group "chatbot" I only have a single chatbot. How should that be modeled?

  • If I assign the same participant id to all scores in group "chatbot", the result barely ever becomes statistically significant, no matter how many questions I use and how many participants in group "human" I add.
  • If I consider each scores in group "chatbot" as a score from a separate individual, I very easily get statistical significant results, but I suspect this skews the estimated variance between individuals, invalidating the test.
  • I could divide the answers in group "chatbot" into a number of fictional individuals in equal amount to the individuals in group "human", but that would just be a completely arbitrary mix of the two above.

The problem is that I know that there is no participant random effect in group "chatbot" (since it is all just the same "individual", the chatbot), but the model can't take that into account, at the same time as accounting for the individual differences in the doctors.

Question
I'm now looking for a different statistical method or a way to adapt the mixed effects model so that it works for my case, but I don't know how. I would appreciate any ideas!

$\endgroup$

1 Answer 1

1
$\begingroup$

Item response theory is designed to answer this sort of question. It's been a long time since I studied it (like, 25 years) so I don't want to give any details that are probably outdated. But, back then anyway, it required quite a lot of data.

It may be worth looking into, anyway.

$\endgroup$
2
  • 1
    $\begingroup$ Thanks! I'll look that up at once and see if I can wrap my mind around it! $\endgroup$
    – Ylor
    Commented Oct 30, 2023 at 13:22
  • 1
    $\begingroup$ After reading up it seems you are absolutely right! Item response theroy seems to be just what I need. $\endgroup$
    – Ylor
    Commented Oct 30, 2023 at 18:33

Not the answer you're looking for? Browse other questions tagged or ask your own question.