[TUHS] On the unreliability of LLM-based search results

Tue May 27 12:21:07 AEST 2025

Jason Bowen via TUHS <tuhs at tuhs.org> writes:

> May 26, 2025 11:57:18 Henry Bent <henry.r.bent at gmail.com>:
>
>> It's like Wikipedia.
>
> No, Wikipedia has (at least historically) human editors who 
> supposedly
> have some knowledge of reality and history.
>
> An LLM response is going to be a series of tokens predicted 
> based on
> probabilities from its training data. The output may correspond 
> to a
> ground truth in the real world, but only because it was trained 
> on
> data which contained that ground truth.
>
> Assuming the sources it cites are real works, it seems fine as a
> search engine, but the text that it outputs should absolutely 
> not be
> thought of as something arrived at by similar means as text 
> produced
> by supposedly knowledgeable and well-intentioned humans.

LLMs are known to hallucinate sources. Here's a database of "legal
decisions in cases where generative AI produced hallucinated 
content":

  https://www.damiencharlotin.com/hallucinations/

Here's a research paper about LLMs hallucinating software 
packages:

  https://arxiv.org/abs/2406.10279

Not to mention about LLMs hallucinating 'facts' about people:

  https://www.abc.net.au/news/2025-03-21/norwegian-man-files-complaint-chatgpt-false-claims-killed-sons/105080604

As a result of what they're trained on, chatbots can be 
"confidently wrong":

  > [T]he chatbots often failed to retrieve the correct
  > articles. Collectively, they provided incorrect answers to 
  > more than
  > 60 percent of queries. Across different platforms, the level 
  > of
  > inaccuracy varied, with Perplexity answering 37 percent of the
  > queries incorrectly, while Grok 3 had a much higher error 
  > rate,
  > answering 94 percent of the queries incorrectly.

-- 
   https://www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

and can contradict themselves:

  https://catless.ncl.ac.uk/Risks/34/04/#subj2.1

Coming back to LLMs in the context of software:

  > AI coding tools ‘fix’ bugs by adding bugs
  >
  > ...
  >
  > What happens when you give an LLM buggy code with and tell it 
  > to fix
  > it? It puts in bugs! It might put back the same bug!
  >
  > Worse yet, 44% of the bugs the LLMs make are previously known
  > bugs. That number’s 82% for GPT-4o.
  > 
  > ...
  > 
  > I know good coders who find LLM-based autocomplete quite okay 
  > —
  > because they know what they’re doing. If you don’t know what 
  > you’re
  > doing, you’ll just do it worse. But faster.

-- 
   https://pivot-to-ai.com/2025/03/19/ai-coding-tools-fix-bugs-by-adding-bugs/

A research paper found that:

  > participants who had access to an AI assistant based on 
  > OpenAI's
  > codex-davinci-002 model wrote significantly less secure code 
  > than
  > those without access. Additionally, participants with access 
  > to an
  > AI assistant were more likely to believe they wrote secure 
  > code than
  > those without access to the AI assistant. Furthermore, we find 
  > that
  > participants who trusted the AI less and engaged more with the
  > language and format of their prompts (e.g. re-phrasing, 
  > adjusting
  > temperature) provided code with fewer security 
  > vulnerabilities.

-- https://arxiv.org/abs/2211.03622

As someone who has spent a lot of time and energy working on 
wikis, i would say that a big difference between LLMs and wikis is 
that one can directly fix misinformation on wikis in a way that 
one can't do with LLMs. And wikis typically provide public access 
to the trail of what changes were made to the information, when, 
and by whom (or at least from what IP address), unlike the 
information provided by LLMs.

The LLM cat is well and truly out of the bag. But the combination 
of LLMs hallucinating information, together with the human 
tendency to correlate the confident conveyance of information with 
veracity of that information, means that people should be 
encouraged to take LLM output with a cellar of salt, and to check 
the output against other sources.

Alexis.