How often does ChatGPT give an incorrect answer to an SE question?

Question

This is a follow-up of Ban ChatGPT network-wide. How often does ChatGPT give an incorrect answer?

ChatGPT loves the phrase "I hope this helps! Let me know if you have any other questions", searching for it on SO may be insightful. There are 9 results before December 2022, and 34 in December 2022, when ChatGPT became publicly available. — smitop, Commented Dec 5, 2022 at 16:22
Very often. It cannot know the correct answer. It just generates text that conforms to what an answer might look like. There was one user who posted two answers contradicting each other. And they were both still wrong. I've seen answers that contradict themselves even in the same paragraph. — VLAZ, Commented Dec 5, 2022 at 16:28
Not SE questions, but this thread passed by my Twitter timeline and is a good illustration of "wrong answers": twitter.com/studentactivism/status/… — Tinkeringbell, Commented Dec 5, 2022 at 16:54
@Tinkeringbell also see here twitter.com/vogon/status/1598334517647134720 and here is a paper on common problems with this approach to text generation: dl.acm.org/doi/10.1145/3442188.3445922 — VLAZ, Commented Dec 5, 2022 at 17:26
@Tink My favourite quote from that Twitter thread is "Because if ChatGPT is, as it seems to be, a consummate bullshitter, it's also—definitionally—a bullshitter who doesn't know when its bullshitting. And we all know that that's the most dangerous kind." — PM 2Ring, Commented Dec 5, 2022 at 17:27
This is unanswerable as written. Actually having an answer to this would require information which isn't available to anyone, because the complete answer would include every attempt that's ever been made of putting a question into ChatGPT. You'd need the logs for all input to ChatGPT (matched to SE questions) and all of its output. You'd then need to have humans evaluate every single ChatGPT response for accuracy. You could do a sampling using randomly selected questions, with each answer human evaluated for accuracy, but there's likely to be substantial variation across SE sites. — Makyen, Commented Dec 10, 2022 at 19:25
Even the question you probably intended to ask: "Of all ChatGPT answers posted to SE sites, what percentage are incorrect/correct?" is unanswerable, assuming some semi-reasonable limit on the amount of human effort (note that posted answers are already a filtered subset of those created; presumably at least slightly filtered for accuracy by those posting). To estimate this, you'd have to look at a substantial number of randomly selected ChatGPT answers (including selecting from all deleted ones on all sites) and individually evaluate each for accuracy by humans. That's a huge amount of work. — Makyen, Commented Dec 10, 2022 at 19:26
@Makyen human evaluation on some prediction subset is something commonly done in my field (which turns out to include ChatGPT). 100 samples, 3 humans, 1h, done. The downvoters have no clue about NLP/AI research. — Franck Dernoncourt, Commented Dec 10, 2022 at 19:38
I find it unlikely that you'd find 3 humans who, together, are experts on every subject covered on all SE sites. Yes, I know that you don't really mean "use 3 people", but I think your example underestimates the scope. If you don't have experts on all subjects being evaluated, then you don't really know if the answer is correct. These answers do fool people, particularly when the person isn't an expert and/or when not read critically. Eliminating possible Q&A because you don't have an expert for it introduces bias, which doesn't mean it wouldn't be data, just not what you asked for. — Makyen, Commented Dec 10, 2022 at 19:46
@Makyen subsample domains, analyze variance of the prediction quality across domains, etc. — Franck Dernoncourt, Commented Dec 11, 2022 at 4:19
Is that why this was closed, because "this is unanswerable as written"? A question should not be judged by its answers, though, and "this is unanswerable" is a perfectly fine answer to a perfectly fine question (I am not sure this Q is of the latter kind). — Joachim, Commented Dec 12, 2022 at 15:38
I don't understand some comments. Surely the asker would be happy with an empirical estimation, sampling a certain number of questions and then trying to find some reasonable confidence intervals. It's an interesting question. Even with the current limitations, I would say that a correct answer rate of say higher than 10-20% would be impressive. I just wonder, what ballpark are we currently in? — NoDataDumpNoContribution, Commented May 9, 2023 at 19:58
@Trilarion Here's an evaluation arxiv.org/abs/2304.05613 but it's not specific to SE. I had started some evaluation on SE but the question got closed, so I stopped it. Some commenters and close voters clearly have no evaluation experience. — Franck Dernoncourt, Commented May 9, 2023 at 20:11
All really crap answers below. I would be very surprised if performed blind ChatGPT4 answers would now not score at least as well as random Stack Overflow answers and perhaps even as good or better than the answers indicated as the current best answers. Cf also the examples I gave here meta.stackexchange.com/questions/387575/…. Especially for programming. This needs to be properly benchmarked, done blind, not by some opinionated moderator but as a research project... — Tom Wenseleers, Commented Jun 18, 2023 at 10:05

Glorfindel · Accepted Answer · 2022-12-05 16:29:08Z

As a language model, ChatGPT is not designed to provide answers to specific questions, especially those related to a specific topic or subject. Instead, it uses a large corpus of text to generate responses based on the input it receives. This means that the responses it generates may not always be accurate or relevant to the specific question being asked. Additionally, ChatGPT does not have access to external information, such as the internet, so it cannot provide answers to questions that require knowledge beyond what it has been trained on. In short, ChatGPT is not intended to be used as a source of information, and it is not able to provide accurate answers to all questions.

Emphasis mine - you may guess what the source of that text is.

I don't think it really matters how often it's correct or not; the main question right now is if the community can handle the times it's not. Automation is allowed on Stack Exchange if it performs significantly better than humans, but I trust the Stack Overflow community if they say it's beyond control. The score on this question speaks volumes.

Considering that Bing now gives access to the wider internet, it seems that this has to be asked again. — Braiam, Commented Mar 16, 2023 at 15:41

Robert Longson · Accepted Answer · 2022-12-05 17:26:08Z

10

From memory when looking through the chatGPT answers here when they existed. I think that was around 10-15 answers before they were deleted.

Somewhere between a third and a half had comments indicating that they were problematical. I.e. the OP or a SME had commented on them raising issues. It's possible many more were wrong, but nobody had looked at them closely yet.
One or two were accepted. That doesn't mean they were correct, merely that they were helpful to the question asker.
One was downvoted. Of course downvoting costs rep and you need 125 rep to be able to do it at all so most question askers can't downvote.

The trouble is ChatGPT is prepared to make up an answer on any subject under the sun and so we'd likely need as many SMEs as answers to check them all properly. Neither "I don't know", or "I'm not sure" seem to be in its range of possible answers.

edited Dec 5, 2022 at 17:26

answered Dec 5, 2022 at 17:05

Robert Longson

35.6k13 gold badges82 silver badges163 bronze badges

18

Unfortunately, accepting doesn't even mean it was actually helpful to the question OP. It only means the question OP clicked the button. I recall one of these ChatGPT answers which I deleted, where the question OP had awarded a bounty to it, accepted it, and then left a comment to the effect of "this doesn't work". What I assume happened is that the question OP accepted and awarded the bounty because it looked good, but hadn't actually tried what the answer was saying was the solution, which is a good example of the primary problem with these ChatGPT generated answers: bad answers look good.
– Makyen
Commented Dec 5, 2022 at 19:05
@Makyen so, basically if we get a human that makes "looking good" but factually incorrect answers, then that's ok? I guess we are so tuned to look for signs of "bad quality" that we forgot the main one: being incorrect.
– Braiam
Commented Mar 16, 2023 at 15:43
3

No, @Braiam, incorrect answers by humans aren't desired. However, human generated poor answers tend to get downvoted. The difference is A) volume of posts and ease of creation, and B) human generated poor answers tend to be more easily identified as poor quality, because it takes most humans quite a bit of work to compose an answer that has good grammar, sounds confident, and presents in a manner that "looks good". AI generated content takes seconds, at most, and tends to be "eloquent bullshit", which looks and sounds good, even when hilariously wrong and self-contradictory.
– Makyen
Commented Mar 16, 2023 at 16:37
As to (A), the ease of creation and speed of AI vs human generation of content means that there are orders of magnitude more posts generated by the AI than a human doing the same. There are orders of magnitude more people posting them, often because those people think they are helping, when they are actually hurting by posting content they have no clue as to if it's correct, or not, which causes the person asking even more confusion.
– Makyen
Commented Mar 16, 2023 at 16:37
2

As to (B), the fact that the AI generated text is "eloquent bullshit" results in confusing a lot of people into thinking it's correct when it's not. It also tends to mean that evaluating it for correctness tends to require a higher level of subject matter expertise and, again, orders of magnitude more more time.
– Makyen
Commented Mar 16, 2023 at 16:37
2

Overall, these combine to mean that it requires orders of magnitude more curation resources to evaluate the aggregate volume of AI generated content on an "is this correct" basis. The sites, and SO in particular, are substantially below the level of curation resources they need to actually curate the already existing volume of human generated content. Adding the orders of magnitude additional curators needed to properly evaluate AI generated content, particularly individuals with a high level of subject matter expertise, over all subject areas, is something that's just not possible.
– Makyen
Commented Mar 16, 2023 at 16:38
1

Until an AI has inherent checks for correctness, it's just non-viable to allow the content. If it was only posted by people capable of checking for the correctness of that post, and who actually do so, it would be a different issue, but 99.9% (quite close to an accurate percentage) of the posted content isn't that. Unfortunately, even strong advocates for AI generation will routinely not care about manually checking for accuracy and just lie about doing so. We just aren't able to weed out that 99.9% in order to allow the reasonable 0.1%.
– Makyen
Commented Mar 16, 2023 at 16:38
Does AI get things right? Sometimes. Can it be helpful to some people? Sometimes. Can it be close enough or just feedback from something else allowing the human using it to break out of their thought patterns and get to an answer? Yes, sometimes. Is it often sufficiently wrong to cause the person using it to be even more confused or head in an even worse direction? Definitely. Currently, AI generation capability has absolutely no ability to even attempt to judge correctness. Doing so is just not part of the design. Until it is, AI generation will continue to be just "eloquent bullshit".
– Makyen
Commented Mar 16, 2023 at 16:38
1

@Braiam You're literally months late joining the discussion on AI generated content. While your position and opinion are valued, you're rehashing arguments that have been made many times over the last few months without appearing to bring anything new. If you're interested in the topic, please take the time to go through the existing discussions and move forward from them, rather than rehashing things that have already been covered in detail multiple times.
– Makyen
Commented Mar 16, 2023 at 16:38
@Makyen "you're rehashing arguments that have been made many times over the last few months without appearing to bring anything new", except that ChatGPT is new. As said elsewhere is a tool in constant development, now with access to the wider internet (Bing) and that funny incorrect answers are being corrected (ask what's the fastest marine bird, it's finally not the peregrine falcon). It's not that I bring something new, it's that we need to reassess our previous understanding about the tool. Like an answer on SE, it's not guaranteed to be correct from now to the heat death of the universe.
– Braiam
Commented Mar 16, 2023 at 19:24
If we are going to be so obtuse to not being able to reassess our previous point when the context changes, then why the heck we discuss things at all? Opinions need to be able to change @Makyen, if they do not, then I prefer not having them at all.
– Braiam
Commented Mar 16, 2023 at 19:25

Add a comment |

Scortchi - Reinstate Monica · Accepted Answer · 2023-06-22 08:51:52Z

Not a systematic study: but of 20 I read carefully (purported answers posted to Cross Validated, or responses to actual C.V. questions input to ChatGPT by me)^†; I judged one to be useful, though not as useful as the relevant section of the obvious Wikipedia article, & about half of the rest to be not merely useless but positively misleading.

Extrapolating to other sites would be unwise. For example, I found ChatGPT to be terribly bad at answering questions with much mathematical content, so it wouldn't be surprising if useful answers from it to questions on Mathematics, Physics, &c. were vanishingly rare.

† I used the free version of ChatGPT available in Nov./Dec. 2022 (v.3?) & did not attempt any 'prompt engineering'; I'm pretty sure the users who posted the purported answers were following a similar process.

Franck Dernoncourt · Accepted Answer · 2023-08-09 16:21:59Z

4

Some findings from the paper Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions. Samia Kabir, David N. Udo-Imeh, Bonan Kou, Tianyi Zhang. arXiv:2308.02312:

answered Aug 9, 2023 at 16:21

Franck Dernoncourt

41.3k5 gold badges60 silver badges163 bronze badges

Add a comment |

Rebecca J. Stones · Accepted Answer · 2023-05-16 03:52:10Z

I copy/pasted 10 random questions from the Chinese Stack Exchange into https://poe.com/Sage which is described as:

Good in languages other than English, and also in programming-related tasks. Powered by gpt-3.5-turbo.

I give its answers a percentage value for how good I think the answer is (100% = about as good as an answer can get).

Question	GPT 3.5 turbo, answer quality (%)
Difference between 必须得 and 一定要？	100%, significantly better than the current answers
的 in the “你们是坐飞机来北京的吗？”	20%, it got totally confused
what is the meaning of 都不甘于平庸 in the song 致你 by 苡慧?	70%, it struggled with song lyrics (it's hard for me to judge this one---I don't know myself)
Function of 了 Completion of Action or Change of State	90%, there were small imperfections if I'm being nit-picky
Can 人口 be interpreted as a number? [my question]	100%, clearly understood the question and my difficulties
What's the function/meaning of 就不要 in 过去的事情就不要说了?	100%, clearly understood the question, and concisely answered correctly, better than two (of three) human answers
How to say "a quarter past nine" in Chinese?	100%, easy question
What is the meaning of 教训 here？	100%, with an explanation significantly better than the current answers
Is it possible for 已 to make a sentence into the future tense?	80%, the explanation is a bit weird in places
What is the colloquial meaning of 小祖宗？	100%, it's a good answer, but so is the top human-submitted answer

The answers it's writing are friendly, concise, almost always immaculately grammatical, and don't omit answering half the question (e.g., when someone asks a general question and gives examples). Two of the answers it wrote are substantially better than the current human-written answers. (What do I do with them?!) The first question contained a typo (which I left in---the question says 你必须的有咖啡? , but it should be 你必须得有咖啡? ), and the AI wasn't confused.

So in answer to the question:

How often does ChatGPT give an incorrect answer?

I'm not sure how GPT 3.5 Turbo and ChatGPT are different, but from the above experiment, GPT 3.5 Turbo was incorrect only once, so 10% of the time.

Update: I've been copy/pasting a lot of Chinese.SE questions into Sage (see: Examples of an AI (Sage) answering questions at Chinese.SE [reference post]). What I'm now understanding is that there's selection bias in my process...

Of the AI-generated answers I'm able to judge as (un)reasonable, almost all are reasonable. Much like the above, some have imperfections, some are great answers, and every now and then its answers are total rubbish. However, for more advanced Chinese questions, perhaps AI-generated answers are more likely to be nonsensical, and I'm unable to evaluate them.

The AI is really good at questions like "Is there a Chinese idiom meaning [something]?" and "What's the difference between [x] and [y]?" It's mediocre at questions like "How did the character [z] evolve?" I'm guessing it's poor at humor in Chinese, since it's poor at humor in English.

TL;DR: ChatGPT is capable of generating reasonable, and even excellent, answers to a subset of Chinese.SE questions (albeit with occasional nonsense). It's possible outside of this subset, ChatGPT produces nonsense more frequently.

Thanks! "I'm not sure how GPT 3.5 Turbo and ChatGPT are different," -> same model. GPT is indeed pretty good at language questions (I tested it a bit on French). — Franck Dernoncourt, Commented Apr 13, 2023 at 5:34
I've found it very useful in generating example sentences using Spanish words for my anki deck. — SurpriseDog, Commented May 31, 2023 at 15:37
If curious how about why LLMs are good at translating: Briakou, Eleftheria, Colin Cherry, and George Foster. "Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability." arXiv preprint arXiv:2305.10266 (2023). arxiv.org/abs/2305.10266 — Franck Dernoncourt, Commented Jun 18, 2023 at 10:14

Stack Exchange Network

How often does ChatGPT give an incorrect answer to an SE question?

5 Answers 5

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
discussion
answers
chatgpt
.

Linked

Hot Network Questions

How often does ChatGPT give an incorrect answer to an SE question?

5 Answers 5

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged discussionanswerschatgpt.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
discussion
answers
chatgpt
.