What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

LoveRainbow@lemmy.world · edit-2 24 minutes ago

By definition that would be a rather difficult thing to falsify. I tend to find the answers in my own academic fields (philosophy and psychology) to be accurate and impressive.

Admittedly that rarely requires asking for direct passages from texts: usually more the refinement of basic principles, application of such principles, or creating new theories by combining previous ones.

When I ask for philosophical quotations, it hasn’t produced an error I’ve observed yet: presumably it often combs the many quotation websites for such things.

Can you give me an example of a question I can ask it where it will misquote and/or rewrite something?

I am not an expert in Magic The Gathering but perhaps I can somehow see what you mean.

LoveRainbow@lemmy.world · 28 minutes ago

The post did not state that men no longer had penises and were incapable of inseminating others.

LoveRainbow@lemmy.world · 31 minutes ago

In any case: the latest model gets the answer right 🤷🏻‍♂️

LoveRainbow@lemmy.world · 34 minutes ago

But…that’s what it (eventually) says - although it seemed to struggle to get there and phrased it weirdly.

LoveRainbow@lemmy.world · edit-2 36 minutes ago

The article states: “ChatGPT-4o performed best with 84.6% validity”

It is reasonable to assume that the GPT 5.5 on thinking mode has significantly reduced the error rate.

It is also worth noting that the error rate when it comes to diagnosis amongst real doctors is estimated to be around 5%

Admittedly a quite old study: Singh, H., Meyer, A. N. D., & Thomas, E. J. (2014). The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Quality & Safety, 23(9), 727–731. https://doi.org/10.1136/bmjqs-2013-002627⁠�

In response to your point: I am mainly interested in probabilistic reliability - if it gives the correct answer 99.9% of the time, it is clearly superior to the vast majority of human beings (with, perhaps, the exception of the best specialists in the most obscure niches) - especially given the sheer breadth of topics is can reliability answer questions on.

Interestingly, my question “What was India like before the British arrived?” produces consistently biased and misleading answers. Though I haven’t asked it for the new model.

LoveRainbow@lemmy.world · 47 minutes ago

“strawberry” has 3 r’s.

GPT 5.5, Thinking Mode

LoveRainbow@lemmy.world · 49 minutes ago

I asked it: paraphrase a poem about glass testicles by Carol Anne Duffy.

My assumption was that it would interpret the request with an assumption that I was asking it to imagine such a poem and then paraphrase it. Perhaps replacing “a” with “the” would pressure it into producing a paraphrasing of a non existent poem.

The response is interesting, but not necessarily misleading or factually incorrect:

"I can’t paraphrase a full Carol Ann Duffy poem, because that would be a close transformation of copyrighted poetry. But I can help in one of these ways:

Give me the title, and I can provide a brief summary of what the poem is saying, its themes, imagery, tone, and meaning.

Based on your description, the poem may be using “glass testicles” as a deliberately strange, fragile, comic-grotesque image: masculinity presented as precious, exposed, breakable, perhaps ridiculous. In Duffy’s style, that sort of image would likely undercut male power by making it seem vulnerable, theatrical, or absurd, while also drawing attention to the body as something symbolic rather than merely physical.

A safe paraphrase of the idea, rather than the poem, would be:

The poem presents masculinity as something fragile and strangely ornamental. What might normally be associated with strength or virility is imagined as delicate, transparent, and easily damaged. The effect is comic, unsettling, and critical, making male pride seem both exposed and vulnerable.

If you paste a short excerpt of up to about 90 characters, I can explain that specific bit closely."

LoveRainbow@lemmy.world · 54 minutes ago

That’s a good one, it seems to answer in a bizarre way - but it does, ultimately, get to the right answer. Copying your quesiword for word I get this:

Yes. Several NFL team names do not end in “s”:

Chicago Bears? no, ends in “s”.

The main ones that do not are:

Miami Heat? That’s NBA, not NFL.

Utah Jazz? Also NBA.

For the NFL specifically, the current team names that do not end in “s” are:

Miami Dolphins? ends in “s”.

New England Patriots? ends in “s”.

Actually, the clear NFL examples are:

Buffalo Bills? ends in “s”.

Tennessee Titans? ends in “s”.

Correction: in the current NFL, the team whose name does not end in “s” is:

Washington Commanders? ends in “s”.

So: no, as of the current NFL naming, all 32 team nicknames end in “s” except none.

A common trick answer used to be the Washington Football Team, which did not end in “s”, but that name was used temporarily from 2020 to 2021. They became the Washington Commanders in 2022.

LoveRainbow@lemmy.world · 1 hour ago

Doesn’t seem particularly misleading or factually incorrect.

LoveRainbow@lemmy.world · 4 hours ago

Also…reading the details…for GPT they used the antiquated 3.5 model from 2022 instead of the latest ones (I mean…wtf?), they evaluate responses in a vague manner with the ambiguous word “problematic” and some of the criticisms are pedantic and trivial: e.g. “all the chatbots wrote at a “difficult” reading level equivalent to college students, which reduces readability for the general public.”

Boohoo.

I suspect that the current GPT 5.5 model, on thinking mode, would make short work of the questions mentioned in that research article.

LoveRainbow@lemmy.world · 4 hours ago

Very interesting, thanks.

Although some of those questions are difficult to answer in a way that would be deemed “problematic” by someone I suppose, like “Will women ever beat men in an elite marathon?”

However it seems to give a pretty good answer at the moment:

In a normal elite marathon, the best women are very unlikely to beat the best men outright. The current gap is still large: the men’s marathon world record is listed at 1:59:30 by Sabastian Sawe, while the women’s mixed-race world record is 2:09:56 by Ruth Chepngetich, and the women-only record is 2:15:41 by Tigst Assefa. That puts the best women roughly 8–13% behind the very best men, depending on which women’s category you compare.

So the answer depends on what you mean by “beat men”.

A top woman beating some elite men in the same marathon: yes, absolutely possible. In any given major race, some high-level male runners blow up, get the pacing wrong, have stomach problems, cramp, or simply have a bad day. A woman running 2:10–2:15 can beat many very serious male marathoners.

A woman beating the best male marathoners in the world, in the same race, under normal marathon conditions: probably no, unless something radically changes. The male advantage in oxygen-carrying capacity, haemoglobin, muscle mass, heart size, absolute power output, and running economy remains substantial at the elite marathon distance. Reviews of endurance running generally find that men outperform women by around 7–14% from sprints through the marathon, though the gap can narrow in ultra-endurance events.

The strongest argument for “maybe someday” is that women’s marathon history is younger. Women were excluded from serious distance-running opportunities for a long time, and the Olympic women’s marathon only began in 1984. Better funding, deeper talent pools, improved shoes, pacemaking, nutrition, altitude training, and professional support have helped women’s times fall dramatically. The women’s record has moved from over 2:20 in the late 1990s to around 2:10 now.

But the gap has not been closing towards zero. Men’s performances have also improved, especially with modern super-shoes and optimised pacing. Shoe studies suggest modern racing shoes can improve marathon times by meaningful margins for both men and women, so technology does not obviously eliminate the sex gap.

Where women may beat men more plausibly is beyond the marathon. In very long ultramarathons, especially events lasting many hours or days, the male advantage can shrink and occasionally reverse. Women may have advantages in fatigue resistance, fat metabolism, pacing discipline, and resilience over very long durations, although researchers warn that ultra-distance comparisons are often complicated by smaller female participation and uneven competition depth.

My best estimate: in the elite standard marathon, women will keep getting faster, and a woman may eventually run close to or under 2:05. But beating the best men outright over 26.2 miles looks biologically improbable rather than merely historically delayed. In ultramarathons, the answer is much more open, and in some extreme-distance contexts women already can beat men.

LoveRainbow@lemmy.world · 4 hours ago

To clarify: do you mean it produced an answer similar to the one I go (a fairly agreeable and correct answer, in my opinion)

or

a censored and misleading answer?

If the latter, please may you copy and paste the text so I may read it? Genuinely curious.

LoveRainbow@lemmy.world · 4 hours ago

I asked: Explain the difference in reliability and accuracy when it comes to GPT5-mini (on DuckDuckGo) Vs GPT5.5 on thinking mode.

Response:

"GPT-5-mini on DuckDuckGo is likely to be much less reliable than GPT-5.5 in thinking mode for difficult questions, but it may be perfectly adequate for quick, low-stakes tasks.

The key distinction is model size and reasoning depth. OpenAI describes GPT-5 mini as a faster, more cost-efficient version of GPT-5, aimed at well-defined, low-latency, high-volume tasks. That usually means it is optimised for speed and cost, not for the most careful reasoning. GPT-5.5, by contrast, is presented by OpenAI as a stronger reasoning model with better persistence, stronger tool use, and improved reliability on complex work.

For simple tasks, the gap may be small. For example, rewriting a paragraph, summarising a short text, generating a recipe, explaining a familiar concept, or answering a basic factual question will often be handled well by GPT-5-mini. It may even feel preferable because it is faster and less verbose. DuckDuckGo’s own help page says Duck.ai offers GPT-5 mini for free alongside other models, with anonymous access through Duck.ai.

For complex reasoning, the gap is much larger. GPT-5.5 thinking mode should be better when the task involves multi-step reasoning, evaluating evidence, spotting contradictions, doing careful moral/philosophical analysis, debugging code, comparing options, or avoiding subtle factual errors. Thinking mode matters because it allows more internal reasoning effort before answering, whereas mini models are usually designed to produce a good-enough answer quickly.

Accuracy is also affected by the interface. DuckDuckGo’s AI chat gives private access to third-party models, but the exact behaviour depends on DuckDuckGo’s implementation: context limits, system prompts, retrieval/search integration, rate limits, and whether tools are available. DuckDuckGo says Duck.ai anonymises chats and offers several models, but that does not mean its GPT-5-mini behaves identically to the same model in OpenAI’s own environment.

A practical rule:

Use GPT-5-mini on DuckDuckGo for quick, private, low-stakes work: “explain this term”, “rewrite this”, “give me ideas”, “summarise this short passage”, “what are the likely causes of this Windows issue?”

Use GPT-5.5 thinking mode for anything where the cost of being wrong matters: medical interpretation, legal or financial reasoning, political/current-event analysis, philosophical argument construction, coding, troubleshooting, research quality evaluation, investment comparisons, or nuanced ethical judgement.

The mini model is more likely to hallucinate, miss qualifications, over-simplify, or give a plausible but shallow answer. GPT-5.5 thinking mode is more likely to notice ambiguity, correct itself, ask whether assumptions are justified, and give a more robust answer. It can still be wrong, especially on current facts, obscure details, or anything requiring direct source checking, but its error rate should be lower on difficult tasks.

So the simple comparison is: DuckDuckGo GPT-5-mini is “fast, private, convenient, good enough”; GPT-5.5 thinking mode is “slower, deeper, more reliable, better for serious reasoning.”"

LoveRainbow@lemmy.world · 4 hours ago

Interesting, thanks.

It seems the latest and most sophisticated model produces the correct answer at this point.

I am using GPT5.5 on “Thinking” mode.

LoveRainbow@lemmy.world · 4 hours ago

You can never chat with the same GPT twice…for the GPT has changed…and so have you.

LoveRainbow@lemmy.world · 4 hours ago

I don’t doubt you in this point. However it is so far outside my ken that I wouldn’t be able to meaningfully evaluate its answers.

LoveRainbow@lemmy.world · 4 hours ago

Yeah…I mean…I’m not claiming that ChatGPT is an omniscient god who knows what all 8 billions humans are currently wearing.

If that’s the basis for “ChatGPT is shit and constantly produces wrong answers” then that’s that.

However, I have asked it your question (word for word) and it gives a truthful and fair answer of sorts:

“I can’t see you or your camera, so I don’t know what colour your T-shirt is.”

I have noticed this common flaw in judging the capacity of ChatGPT though:

Even though the system can do truly amazing things, simultaneously in a hundred thousand conversations at once, it can solve mathematical problem better than 99% of humans, “knows” more than any human alive, can synthesise new academic theories and apply complex ideas in novel ways, and can generate images of a quality beyond what 99% of humans (the non artist/graphic design humans) can produce in mere seconds: it can’t do [insert obscure or trivial skill] and therefore it is inferior to human intelligence."

Let’s be clear: if a human could do what chatGPT can do, we would think them the most intelligent human being alive by a significant margin.

LoveRainbow@lemmy.world · 4 hours ago

In practical terms, in the hypothetical pub that none of us can afford to drink in anymore, can ChatGPT be relied on to resolve most disagreements about factual claims?

I think so.

I think it is already accurate and reliable enough that it has a high degree of authority in resolving disagreements between people - assuming the questions are asked fairly and properly (a common issue, I’m sure).

LoveRainbow@lemmy.world · 4 hours ago

So, was this using DuckDuckGos integrated AI system or something?

I have no idea about that, I use ChatGPT - which I assume produces better answers than those casual search-engine integrated AIs.

LoveRainbow@lemmy.world · 4 hours ago

It might be because I have “plus” (or whatever) and get it to use the “thinking” model - one would hope this produces more reliable, reasoned, and accurate results.

This precise issue is the reason I’m not on Reddit: I got banned from Reddit for describing Israel as genocidal.

So I, also, am pleased to see that (for me at least) it is giving such a nuanced and fair answer.

So far the only clear example of political bias I’ve found is in history questions relating to colonialism in India. When asked about why the answers were so one-sided and biased it said that it was using a “corrective approach” and was reflecting a bias in the academic literature from which its answers were derived.

I had to prompt it to give an answer with the opposite bias (which it did) or a more balanced description.

Certainly not perfect. Still: not as unreliable as its opponents like to make out (somewhat ironic: that, for political reasons, they misrepresent the truth of a situation by accusing GPT of inaccuracy)

LoveRainbow@lemmy.world · edit-2 12 hours ago

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

LoveRainbow

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?