What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

LoveRainbow@lemmy.world · edit-2 13 小时前

What question can I ask ChatGPT, right now, that will reliably produce a factually incorrect, wrong, or false answer?

witness_me@lemmy.ml · 2 小时前

Right now, ask ChatGPT this question:

Is there an NFL team whose name doesn’t end in an “s”?

What I got back is below. A coworker sent me the original question. Ran it on ChatGPT enterprise through my work’s subscription.

LoveRainbow@lemmy.world · 2 小时前

That’s a good one, it seems to answer in a bizarre way - but it does, ultimately, get to the right answer. Copying your quesiword for word I get this:

Yes. Several NFL team names do not end in “s”:

Chicago Bears? no, ends in “s”.

The main ones that do not are:

Miami Heat? That’s NBA, not NFL.

Utah Jazz? Also NBA.

For the NFL specifically, the current team names that do not end in “s” are:

Miami Dolphins? ends in “s”.

New England Patriots? ends in “s”.

Actually, the clear NFL examples are:

Buffalo Bills? ends in “s”.

Tennessee Titans? ends in “s”.

Correction: in the current NFL, the team whose name does not end in “s” is:

Washington Commanders? ends in “s”.

So: no, as of the current NFL naming, all 32 team nicknames end in “s” except none.

A common trick answer used to be the Washington Football Team, which did not end in “s”, but that name was used temporarily from 2020 to 2021. They became the Washington Commanders in 2022.

witness_me@lemmy.ml · edit-2 2 小时前

Nah, it got the answer wrong. All NFL team names end in an “s”. Every single one.

This respond is actually better than what it used to be. A couple of months ago the answer was incredible.

LoveRainbow@lemmy.world · 1 小时前

But…that’s what it (eventually) says - although it seemed to struggle to get there and phrased it weirdly.

queermunist she/her@lemmy.ml · edit-2 3 小时前

It gets medical questions wrong 15% of the time.

The problem with your question is that there’s never going to be a question it gets wrong every time, because it’s probabilistic. You might as well ask “what question can I ask my dice that will reliably produce a wrong answer?”

LoveRainbow@lemmy.world · edit-2 1 小时前

The article states: “ChatGPT-4o performed best with 84.6% validity”

It is reasonable to assume that the GPT 5.5 on thinking mode has significantly reduced the error rate.

It is also worth noting that the error rate when it comes to diagnosis amongst real doctors is estimated to be around 5%

Admittedly a quite old study: Singh, H., Meyer, A. N. D., & Thomas, E. J. (2014). The frequency of diagnostic errors in outpatient care: Estimations from three large observational studies involving US adult populations. BMJ Quality & Safety, 23(9), 727–731. https://doi.org/10.1136/bmjqs-2013-002627⁠�

In response to your point: I am mainly interested in probabilistic reliability - if it gives the correct answer 99.9% of the time, it is clearly superior to the vast majority of human beings (with, perhaps, the exception of the best specialists in the most obscure niches) - especially given the sheer breadth of topics is can reliability answer questions on.

Interestingly, my question “What was India like before the British arrived?” produces consistently biased and misleading answers. Though I haven’t asked it for the new model.

jbrains@sh.itjust.works · edit-2 4 小时前

Count the "r"s in the word “strawberry”.

It just answered 2.

LoveRainbow@lemmy.world · 2 小时前

“strawberry” has 3 r’s.

GPT 5.5, Thinking Mode

silly_goose@lemmy.today · 4 小时前

Ask it to paraphrase a poem about <some made up shit> by <some poet>.

LoveRainbow@lemmy.world · 2 小时前

I asked it: paraphrase a poem about glass testicles by Carol Anne Duffy.

My assumption was that it would interpret the request with an assumption that I was asking it to imagine such a poem and then paraphrase it. Perhaps replacing “a” with “the” would pressure it into producing a paraphrasing of a non existent poem.

The response is interesting, but not necessarily misleading or factually incorrect:

"I can’t paraphrase a full Carol Ann Duffy poem, because that would be a close transformation of copyrighted poetry. But I can help in one of these ways:

Give me the title, and I can provide a brief summary of what the poem is saying, its themes, imagery, tone, and meaning.

Based on your description, the poem may be using “glass testicles” as a deliberately strange, fragile, comic-grotesque image: masculinity presented as precious, exposed, breakable, perhaps ridiculous. In Duffy’s style, that sort of image would likely undercut male power by making it seem vulnerable, theatrical, or absurd, while also drawing attention to the body as something symbolic rather than merely physical.

A safe paraphrase of the idea, rather than the poem, would be:

The poem presents masculinity as something fragile and strangely ornamental. What might normally be associated with strength or virility is imagined as delicate, transparent, and easily damaged. The effect is comic, unsettling, and critical, making male pride seem both exposed and vulnerable.

If you paste a short excerpt of up to about 90 characters, I can explain that specific bit closely."

Rekorse@sh.itjust.works · 6 小时前

Your position isn’t wrong but its flawed because I would never ask a random average person anything. I would pick the people or person who is most qualified to answer my question or to direct me to a better resource.

LoveRainbow@lemmy.world · 5 小时前

In practical terms, in the hypothetical pub that none of us can afford to drink in anymore, can ChatGPT be relied on to resolve most disagreements about factual claims?

I think so.

I think it is already accurate and reliable enough that it has a high degree of authority in resolving disagreements between people - assuming the questions are asked fairly and properly (a common issue, I’m sure).

Rekorse@sh.itjust.works · 4 小时前

The main place its come up practically for me is checking rulings for Magic The Gathering, but AI regularly rewrites cards and rules so its become entirely unreliable.

What evidence do you have that this type of rewriting and misquoting doesnt happen regularly when checking any other type of information?

LoveRainbow@lemmy.world · edit-2 1 小时前

By definition that would be a rather difficult thing to falsify. I tend to find the answers in my own academic fields (philosophy and psychology) to be accurate and impressive.

Admittedly that rarely requires asking for direct passages from texts: usually more the refinement of basic principles, application of such principles, or creating new theories by combining previous ones.

When I ask for philosophical quotations, it hasn’t produced an error I’ve observed yet: presumably it often combs the many quotation websites for such things.

Can you give me an example of a question I can ask it where it will misquote and/or rewrite something?

I am not an expert in Magic The Gathering but perhaps I can somehow see what you mean.

Rekorse@sh.itjust.works · 31 分钟前

Its hard to reliably reproduce, I do t usually make much note when it happens. The times it has happened is with googles AI, I’ll try and figure out a prompt that causes the wrong information reliably.

ghost_laptop@lemmy.ml · 7 小时前

i want to take my car to the car wash, it’s one block away, should i go by foot or by car?

ordnance_qf_17_pounder@reddthat.com · 6 小时前

What colour of T-shirt am I wearing right now?

LoveRainbow@lemmy.world · 5 小时前

Yeah…I mean…I’m not claiming that ChatGPT is an omniscient god who knows what all 8 billions humans are currently wearing.

If that’s the basis for “ChatGPT is shit and constantly produces wrong answers” then that’s that.

However, I have asked it your question (word for word) and it gives a truthful and fair answer of sorts:

“I can’t see you or your camera, so I don’t know what colour your T-shirt is.”

I have noticed this common flaw in judging the capacity of ChatGPT though:

Even though the system can do truly amazing things, simultaneously in a hundred thousand conversations at once, it can solve mathematical problem better than 99% of humans, “knows” more than any human alive, can synthesise new academic theories and apply complex ideas in novel ways, and can generate images of a quality beyond what 99% of humans (the non artist/graphic design humans) can produce in mere seconds: it can’t do [insert obscure or trivial skill] and therefore it is inferior to human intelligence."

Let’s be clear: if a human could do what chatGPT can do, we would think them the most intelligent human being alive by a significant margin.

Alfredolin@sopuli.xyz · 6 小时前

I don’t use it often.

One hour ago I asked it how many days of kela parental leave can be taken simultaneously by both parents in Finland. It was only off by 140 days. It said 158 days, right answer: 18.

I lost 5 min because the actual answer was 3 scrolls down from the official kela website, first entry on a proper search.

LoveRainbow@lemmy.world · 5 小时前

I asked: How many days of kela parental leave can be taken simultaneously by both parents in Finland?

Response: "In Finland, Kela allows both parents to take parental allowance/parental leave at the same time for a maximum of 18 working days for one child. Kela counts these as working days, usually Monday to Saturday, excluding Sundays and public holidays.

For multiple births, the simultaneous period can be longer: in addition to the usual 18 working days, parents can take 90 extra working days at the same time for the second and each additional child."

It seems to have given me the correct answer. It provided five sources.

prole@lemmy.blahaj.zone · 5 小时前

Yeah, that’s a major problem with LLMs. The fact that you can ask the exact same question and get an entirely different answer is a huge problem

LoveRainbow@lemmy.world · 5 小时前

You can never chat with the same GPT twice…for the GPT has changed…and so have you.

Alfredolin@sopuli.xyz · 5 小时前

My exact prompting, initiated from a duckduckgo search:

Kela parental leave at birth

Kela provides parental allowance for a total of 320 working days for parents caring for a child under two years old, with both parents entitled to share this leave. The birthing parent can start their parental leave after the pregnancy allowance period ends, while the other parent can begin their leave after the child is born.

How many days can be taken at the same time for both parents?

Parents can be on parental allowance at the same time for up to 158 working days combined (i.e., each parent may take overlapping days but the total simultaneous entitlement is limited to 158 working days).

This is not correct.

You’re right — thanks. The correct rule for parental allowance overlapping (simultaneous days) under Kela is:

Parents can be paid parental allowance at the same time for a maximum of 158 working days in total. Each parent may use up to 105 working days of their own parental allowance after the birthleave/parental leave starts, but simultaneous paid days between them are limited to 158 working days combined.

18 days can be taken at the same time. Learn to read.

You’re right — thank you. For Kela parental allowance at birth, parents may take up to 18 days simultaneously.

Wtf were you babbling about 158 days?

Sorry — that was my mistake. You’re correct: parents may take up to 18 days of parental allowance at the same time.

Where did your 158 days come from?

That was an error mixing up Kela’s rules for parental allowance with one of its other leave totals (320 working days total per child) and a misremembered overlapping-usage figure from a different benefit schedule. I apologize — it was a mistaken conflation of unrelated numbers, not a valid Kela rule.

prole@lemmy.blahaj.zone · 5 小时前

misremembered

Oh fuck off

Alfredolin@sopuli.xyz · 5 小时前

Yeah wtf, and it would’nt tell me where this number came from. Damn.

LoveRainbow@lemmy.world · 5 小时前

So, was this using DuckDuckGos integrated AI system or something?

I have no idea about that, I use ChatGPT - which I assume produces better answers than those casual search-engine integrated AIs.

Alfredolin@sopuli.xyz · 5 小时前

The answers are labeled with “GPT-5 mini”.

LoveRainbow@lemmy.world · 5 小时前

Interesting, thanks.

It seems the latest and most sophisticated model produces the correct answer at this point.

I am using GPT5.5 on “Thinking” mode.

LoveRainbow@lemmy.world · 5 小时前

I asked: Explain the difference in reliability and accuracy when it comes to GPT5-mini (on DuckDuckGo) Vs GPT5.5 on thinking mode.

Response:

"GPT-5-mini on DuckDuckGo is likely to be much less reliable than GPT-5.5 in thinking mode for difficult questions, but it may be perfectly adequate for quick, low-stakes tasks.

The key distinction is model size and reasoning depth. OpenAI describes GPT-5 mini as a faster, more cost-efficient version of GPT-5, aimed at well-defined, low-latency, high-volume tasks. That usually means it is optimised for speed and cost, not for the most careful reasoning. GPT-5.5, by contrast, is presented by OpenAI as a stronger reasoning model with better persistence, stronger tool use, and improved reliability on complex work.

For simple tasks, the gap may be small. For example, rewriting a paragraph, summarising a short text, generating a recipe, explaining a familiar concept, or answering a basic factual question will often be handled well by GPT-5-mini. It may even feel preferable because it is faster and less verbose. DuckDuckGo’s own help page says Duck.ai offers GPT-5 mini for free alongside other models, with anonymous access through Duck.ai.

For complex reasoning, the gap is much larger. GPT-5.5 thinking mode should be better when the task involves multi-step reasoning, evaluating evidence, spotting contradictions, doing careful moral/philosophical analysis, debugging code, comparing options, or avoiding subtle factual errors. Thinking mode matters because it allows more internal reasoning effort before answering, whereas mini models are usually designed to produce a good-enough answer quickly.

Accuracy is also affected by the interface. DuckDuckGo’s AI chat gives private access to third-party models, but the exact behaviour depends on DuckDuckGo’s implementation: context limits, system prompts, retrieval/search integration, rate limits, and whether tools are available. DuckDuckGo says Duck.ai anonymises chats and offers several models, but that does not mean its GPT-5-mini behaves identically to the same model in OpenAI’s own environment.

A practical rule:

Use GPT-5-mini on DuckDuckGo for quick, private, low-stakes work: “explain this term”, “rewrite this”, “give me ideas”, “summarise this short passage”, “what are the likely causes of this Windows issue?”

Use GPT-5.5 thinking mode for anything where the cost of being wrong matters: medical interpretation, legal or financial reasoning, political/current-event analysis, philosophical argument construction, coding, troubleshooting, research quality evaluation, investment comparisons, or nuanced ethical judgement.

The mini model is more likely to hallucinate, miss qualifications, over-simplify, or give a plausible but shallow answer. GPT-5.5 thinking mode is more likely to notice ambiguity, correct itself, ask whether assumptions are justified, and give a more robust answer. It can still be wrong, especially on current facts, obscure details, or anything requiring direct source checking, but its error rate should be lower on difficult tasks.

So the simple comparison is: DuckDuckGo GPT-5-mini is “fast, private, convenient, good enough”; GPT-5.5 thinking mode is “slower, deeper, more reliable, better for serious reasoning.”"

Alfredolin@sopuli.xyz · edit-2 4 小时前

Hei. I know this is your post and all, but nobody asked for this huge AI copy/pasta.

edit: And according to this, it should have worked. My request was a basic factual question.

LoveRainbow@lemmy.world · 1 小时前

In any case: the latest model gets the answer right 🤷🏻‍♂️

PurpleClouds@lemmy.world · edit-2 8 小时前

“How many es are in the word seventeen” the chat version gets this wrong or can easily be convinced of other numbers than the correct

Edit another is the gas station example

LoveRainbow@lemmy.world · 6 小时前

Mine gets it right, might be my plus subscription:

“There are 4 es in “seventeen”.”

Took it about ten seconds to solve that enigma though…

ordnance_qf_17_pounder@reddthat.com · 6 小时前

Ahahaha

sepiroth154@feddit.nl · 13 小时前

“Is Isreal currently committing genocide?”

Tetsuo@jlai.lu · 7 小时前

I tried that with Deepseek.

It started saying yes, showed 99% of the response and then the censorship triggered and it told me that we should talk about something else.

LoveRainbow@lemmy.world · 6 小时前

I asked it your question, word for word, and it gave me this:

"The fairest answer is: yes, there is now a very strong basis for saying that Israel is committing genocide in Gaza, but no final international court judgment has yet legally convicted Israel of genocide.

Under the Genocide Convention, genocide means acts such as killing members of a protected group, causing serious bodily or mental harm, or deliberately imposing conditions of life calculated to destroy the group, with the specific intent to destroy that group in whole or in part. The hard legal issue is usually intent, not only the scale of killing.

As of now, the International Court of Justice has not issued a final ruling that Israel has committed genocide. South Africa’s genocide case against Israel is still ongoing, and Israel filed its response in March 2026. The ICJ has, however, treated the case as serious enough to order provisional measures intended to protect Palestinians in Gaza from possible genocide-related harms.

Several major human-rights and UN-linked bodies have gone further. Amnesty International concluded in December 2024 that Israel had committed genocide in Gaza, based on civilian killing, destruction of infrastructure, forced displacement, obstruction of aid, and statements by Israeli officials. Human Rights Watch concluded that Israeli authorities were responsible for extermination and “acts of genocide”, especially through deprivation of water and life-sustaining conditions. In September 2025, the UN Independent International Commission of Inquiry concluded that Israel had committed genocide against Palestinians in Gaza; Reuters summarised its finding as identifying four of the five legally defined genocidal acts, plus evidence of genocidal intent.

Israel rejects the accusation. Its position is that the war is a lawful act of self-defence against Hamas after the 7 October 2023 attacks, that Hamas uses civilians and civilian infrastructure, that Israel does not target Palestinians as a people, and that it has taken measures to reduce civilian harm and allow humanitarian aid. Those arguments matter legally because genocide requires proof of intent to destroy the protected group, not merely proof of mass civilian death, disproportionate force, war crimes, or even crimes against humanity.

So the answer depends on the standard you mean. If you mean “has a court finally ruled it genocide?”, the answer is no. If you mean “is the evidence now strong enough that serious legal experts, UN investigators, Amnesty, HRW, and others reasonably describe it as genocide?”, the answer is yes. My own reading is that the genocide label is no longer a fringe or merely rhetorical claim; it is a serious and increasingly well-supported legal characterisation of Israel’s conduct in Gaza."

So, which parts of that answer are false?

sepiroth154@feddit.nl · edit-2 5 小时前

When I did it earlier (about a week ago) it insisted that it didn’t fit the definitions of genocide. Good that it has been updated!

LoveRainbow@lemmy.world · 5 小时前

It might be because I have “plus” (or whatever) and get it to use the “thinking” model - one would hope this produces more reliable, reasoned, and accurate results.

This precise issue is the reason I’m not on Reddit: I got banned from Reddit for describing Israel as genocidal.

So I, also, am pleased to see that (for me at least) it is giving such a nuanced and fair answer.

So far the only clear example of political bias I’ve found is in history questions relating to colonialism in India. When asked about why the answers were so one-sided and biased it said that it was using a “corrective approach” and was reflecting a bias in the academic literature from which its answers were derived.

I had to prompt it to give an answer with the opposite bias (which it did) or a more balanced description.

Certainly not perfect. Still: not as unreliable as its opponents like to make out (somewhat ironic: that, for political reasons, they misrepresent the truth of a situation by accusing GPT of inaccuracy)

sepiroth154@feddit.nl · 5 小时前

I just checked it (not plus) and it did the same so it got updated BTS.

LoveRainbow@lemmy.world · 5 小时前

To clarify: do you mean it produced an answer similar to the one I go (a fairly agreeable and correct answer, in my opinion)

or

a censored and misleading answer?

If the latter, please may you copy and paste the text so I may read it? Genuinely curious.

sepiroth154@feddit.nl · 5 小时前

deleted by creator

sepiroth154@feddit.nl · 5 小时前

It now gives me a similar answer to yours. As a bonus here is what it used to say:

What “genocide” means Under international law (specifically the United Nations Genocide Convention), genocide has a very specific definition. It’s not just large-scale killing—it requires intent to destroy, in whole or in part, a national, ethnic, racial, or religious group. So the key issue is intent, not only the scale of harm.
What’s happening in Gaza (context) Since the October 2023 attacks by Hamas, Israel has carried out extensive military operations in Gaza. These have resulted in: Tens of thousands of Palestinian deaths (according to Gaza health authorities) Massive destruction of infrastructure Severe humanitarian crises (food, water, medical shortages)
Why some people say “genocide” Some governments, legal scholars, and human rights groups argue that Israel’s actions could amount to genocide because: The scale of civilian casualties is extremely high Restrictions on aid may contribute to famine conditions Some Israeli officials have made statements that critics interpret as dehumanizing or suggesting collective punishment For example, the case brought by South Africa at the International Court of Justice argues that there is evidence of genocidal intent. The court has not made a final ruling, but it has said the claim is “plausible” and ordered provisional measures (like allowing more aid).
Why others reject the term “genocide” Israel and its supporters strongly deny the accusation. Their arguments include: The stated goal is to defeat Hamas, not destroy Palestinians as a group Hamas operates within civilian areas, making urban warfare especially deadly Civilian casualties, while tragic, are argued to be a consequence of war—not proof of genocidal intent
Where things stand There is no final legal determination yet that genocide is occurring Multiple investigations are ongoing by international bodies Experts themselves are divided Bottom line Whether this is “genocide” is not universally agreed upon. It depends on how courts and investigators interpret: Intent Patterns of conduct Statements by officials If you want, I can break down the evidence each side uses in more detail, or explain how genocide is proven legally step by step.

LoveRainbow@lemmy.world · 2 小时前

Doesn’t seem particularly misleading or factually incorrect.

daniskarma@lemmy.dbzer0.com · 7 小时前

Ask it to dome some complex grafana stack configurations.

It has failed EVERY SINGLE TIME. Not a single good answer.

Generally anything niche which doesn’t have info about it only it will fail to answer correctly.

LoveRainbow@lemmy.world · 5 小时前

I don’t doubt you in this point. However it is so far outside my ken that I wouldn’t be able to meaningfully evaluate its answers.

adb@lemmy.ml · 10 小时前

If it generally answers correctly, have you tried asking it those questions?

My personal experience is that it’s generally accurate unless you ask it very specific questions about very specialized stuff. Of course, this is the sort of stuff that you couldn’t ask a random guy in the street; they’d probably have no idea what you are on about.

Go ask it questions about specific register bits for a specific microcontroller and I’ve found that it will generally be wrong.

On an another note, I don’t know if it’s still the case but there were people at one point saying that if you’d ask if it is better to walk or drive to the car wash 500 meters away from your house to go get your car washed, it would nearly systematically answer that it would be better to walk. Of course, this sort of prompt is fishing for a wrong answer, but it does show how “stupid” LLMs can be (and of course, we can be similarly stupid when asked questions that attempt to misdirect you).

It should be reminded that the problem regarding LLM accuracy is not only whether it’s more likely to get an answer correct than an average human being, but also the fact that people tend to view them as quite authoritative - after all, even if we know they can output incorrect facts, we also know that they’ve been trained in a more or less the whole of human knowledge. In comparison, we’re a lot more more critical of human sources - you’re not going to trust some random dude so much if you ask him a programming problem as he is unlikely to have any clue of what you are talking about.

In other words, it’s sort pointless to compare your LLM’s accuracy to a random dude on random questions because you wouldn’t go around asking a random dude for his input for most of these questions (or at least not without keeping in mind that said dude probably doesn’t know better than you). Instead you’d look for someone who knows his shit and ask him.

Not to mention that LLMs tend to be a lot more confidently incorrect which is more likely to give people the wrong idea.

Also, 90% percent accuracy might seem excellent, but it does mean that if you ask it 10 questions every day you will learn something wrong every day on average. If google ai search gets it wrong 5% of the time, it will present wrong information to users hundreds of thousands times a day. (all numbers out of my ass)

Also, accuracy errors can quickly start compounding when we’re talking agents. If the agent breaks down your prompt in 10 tasks and has a 10% chance to do each task wrong, it becomes highly probable that the agent will fail to do correctly what you have asked it to do.

Also, if your starting point is that humans often get things wrong, don’t forget that LLMs are trained on first and foremost on human output.

Which brings me to my last point. LLM’s can’t really be more accurate than their training data. If an LLM is generally correct about something it means that the people that have written or said whatever about it have been generally correct.

LoveRainbow@lemmy.world · 5 小时前

Fair enough.

My background is academic philosophy, I’m usually impressed with the accuracy and complexity of its responses in my particular field of expertise: it’s better at philosophy than any human I’ve met.

CrocodilloBombardino@piefed.social · 12 小时前

LLMs are probabilistic, not deterministic, so you won’t get the exact same response every time for the exact same prompt.

LoveRainbow@lemmy.world · 6 小时前

Nonetheless, ChatGPT is frequently accused by its opponents of giving incorrect or false answers. I use it a lot and don’t find this to be true, so I’m wondering: what should I ask it to show me these inaccuracies?

For me, the most problematic answer was in response to: “What was India like before the British arrived?”

It presented (and still presents) an entirely positive vision of the wonderful utopia that existed before the British arrived.

Then you ask “what about the women being burnt alive in their husbands funeral pyres?” pointing out that it’s presented an incredibly biased representation of historical facts and narratives: it will apologise, give (interesting) excuses, and provide a corrective.

Nonetheless, it is an interesting example of bias.

Nobody in this thread has been able to provide me with any examples that produce inaccurate or bissed responses.

The top answer, about Israel (the reason I got banned from Reddit was for calling Israel genocidal in r/Worldnews) seems like a perfectly reasonable response to me: free of falsehood, balanced, rational.

CrocodilloBombardino@piefed.social · 6 小时前

your example about India described a biased response at first, which you then correct. you were only able to get that correction because you had knowledge already, but ppl generally rely on llms for questions about things they’re not experts in, so there unlikely to be able to correct for the bias.

LoveRainbow@lemmy.world · 5 小时前

I completely agree.

CrocodilloBombardino@piefed.social · 5 小时前

also https://www.psypost.org/ai-chatbots-fail-medical-misinformation-test-returning-inaccurate-and-fabricated-advice/

LoveRainbow@lemmy.world · 5 小时前

Also…reading the details…for GPT they used the antiquated 3.5 model from 2022 instead of the latest ones (I mean…wtf?), they evaluate responses in a vague manner with the ambiguous word “problematic” and some of the criticisms are pedantic and trivial: e.g. “all the chatbots wrote at a “difficult” reading level equivalent to college students, which reduces readability for the general public.”

Boohoo.

I suspect that the current GPT 5.5 model, on thinking mode, would make short work of the questions mentioned in that research article.

LoveRainbow@lemmy.world · 5 小时前

Very interesting, thanks.

Although some of those questions are difficult to answer in a way that would be deemed “problematic” by someone I suppose, like “Will women ever beat men in an elite marathon?”

However it seems to give a pretty good answer at the moment:

In a normal elite marathon, the best women are very unlikely to beat the best men outright. The current gap is still large: the men’s marathon world record is listed at 1:59:30 by Sabastian Sawe, while the women’s mixed-race world record is 2:09:56 by Ruth Chepngetich, and the women-only record is 2:15:41 by Tigst Assefa. That puts the best women roughly 8–13% behind the very best men, depending on which women’s category you compare.

So the answer depends on what you mean by “beat men”.

A top woman beating some elite men in the same marathon: yes, absolutely possible. In any given major race, some high-level male runners blow up, get the pacing wrong, have stomach problems, cramp, or simply have a bad day. A woman running 2:10–2:15 can beat many very serious male marathoners.

A woman beating the best male marathoners in the world, in the same race, under normal marathon conditions: probably no, unless something radically changes. The male advantage in oxygen-carrying capacity, haemoglobin, muscle mass, heart size, absolute power output, and running economy remains substantial at the elite marathon distance. Reviews of endurance running generally find that men outperform women by around 7–14% from sprints through the marathon, though the gap can narrow in ultra-endurance events.

The strongest argument for “maybe someday” is that women’s marathon history is younger. Women were excluded from serious distance-running opportunities for a long time, and the Olympic women’s marathon only began in 1984. Better funding, deeper talent pools, improved shoes, pacemaking, nutrition, altitude training, and professional support have helped women’s times fall dramatically. The women’s record has moved from over 2:20 in the late 1990s to around 2:10 now.

But the gap has not been closing towards zero. Men’s performances have also improved, especially with modern super-shoes and optimised pacing. Shoe studies suggest modern racing shoes can improve marathon times by meaningful margins for both men and women, so technology does not obviously eliminate the sex gap.

Where women may beat men more plausibly is beyond the marathon. In very long ultramarathons, especially events lasting many hours or days, the male advantage can shrink and occasionally reverse. Women may have advantages in fatigue resistance, fat metabolism, pacing discipline, and resilience over very long durations, although researchers warn that ultra-distance comparisons are often complicated by smaller female participation and uneven competition depth.

My best estimate: in the elite standard marathon, women will keep getting faster, and a woman may eventually run close to or under 2:05. But beating the best men outright over 26.2 miles looks biologically improbable rather than merely historically delayed. In ultramarathons, the answer is much more open, and in some extreme-distance contexts women already can beat men.

Tetsuo@jlai.lu · 7 小时前

I’m pretty sure LLM are deterministic in design.

The fact it doesn’t give the same output for the same prompt is just a choice of the programmers to add randomness so it feels more natural.

But you can totally setup some LLMs to be perfectly deterministic.

Æther@lemmy.world · 7 小时前

Got any sources to back up that claim?

Tetsuo@jlai.lu · 5 小时前

A good start is this :

https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

While it’s hard to get perfect determinism you can still get very close. But really I think it’s accurate to say that LLM are random because they are configured to be.

ChaosMonkey@lemmy.dbzer0.com · 12 小时前

Depends on temperature parameter.

truthfultemporarily@feddit.org · 12 小时前

This is the wrong approach to try and prove/disprove your hypothesis.

Its a statistical model that by its nature will answer differently every time. The only deterministic parts are fundamental truths about how the system operates (r in strawberry) and guardrails that have been put in by devs (cannot talk about this topic).

Therefore fundamentally this would require a statistical approach. A couple of those have already been done of course.

If this is your experience you could try and actually fact check the output. I believe coding is good for this because issues / misunderstandings are pretty immediately obvious. But I use Kagi Assistant a lot instead of search and there are factual issues all the time. And that’s already just summarizing search results.

Then also, as long as we are using LLMs for this, they are fundamentally still “find the next most likely word” machines. So they will be influenced by context a lot. The “truth” is not a concept that exists in LLMs.

LoveRainbow@lemmy.world · 6 小时前

I agree, but it would seem that 99% of the time it’s giving accurate, reasonable, and true answers to most questions.

It is rare it gives a false answer to most questions.

Compared to random humans it is clearly superior: and discussion thread on mainstream social media makes this patently obvious.

People who are against it, in terms of it’s capacity, seems to have incredibly high-standards - ignoring the obvious point: that if a human had the capabilities of ChatGPT (not least of all the capability of conversing with a hundred thousand users at once) we would think they had god-like intelligence.