The Truth About: Pro Tip: Understanding Inaccuracy in LLMs: Latest Studies on AI

I love AI and where it can go.

However, let’s not forget it is at the infant stage.

Thus, relying on it as the savior for all your tasks to be handled formally by your employees could open the floodgates due to the current flaws in every LLM.

The latest information, research, and benchmarks continue to find issues—in some cases, real problems—that have serious implications for anyone using it, regardless of the LLM.

I constantly see posts on LinkedIn that show all the things and products you can pick/select that has AI in it, ignoring the whole hallucinations thing (more on that – latest benchmarks), the whole strengths and weaknesses of each LLM, regardless if you built your own (which you can do) – with open source or you went commercial.

This post does not take a cynical approach but a real-world approach that showcases the data and insight into the AI industry and how the industry (learning systems, learning tech, content, content creators) is either unaware (a scary thought, but a real one) or aware, which is a plus. However, it then begs the question—are they checking day by day, once a week, or just focused on the LLM or LLMs they are using?

Fine Tuning

Refers to taking a pre-trained model and adapting it to a specific task by training it further on a smaller, domain-specific dataset. Fine-tuning is a form of transfer learning that refines the model’s capabilities, improving its accuracy in specialized tasks without needing a massive dataset or expensive computational resources.” (Geeks for Geeks, Fine Tuning Large Language Model (LLM) | GeeksforGeeks)

What the latest research is finding

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? The Answer is Yes
Understanding Fine-tuning for Factual Knowledge Extraction

Hallucinations with newer models from OpenAI

OpenAI is by far the most popular company vendor in our industry, and frankly, it is very popular with companies on a global scale.

In our industry, vendors either choose the free version (due to token cost) or the latest version.

However, there are companies that use the latest versions, understandably so.

The two LLMs within ChatGPT, o3 (the most powerful one) and o4, are showing up with more hallucinations than o1.

The worst part? No one knows why.

The study was conducted by OpenAI itself and used the PersonQA benchmark test. The test itself involves answering questions about public figures. (ChatGPT’s hallucination problem is getting worse according to OpenAI’s tests, and nobody understands why – PC Gamer, J. Larid, MSN news feed)

How bad is it?

33% percent (o3) two times higher than o1
48% – the new o4 mini
51% and 79% hallucination rates for o3 (51) and o4 mini (79) – Using SimpleQA, a benchmark test which asks general questions (When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. 44% for o1

People on Reddit are finding similar issues. As one noted, when they uploaded a photograph of Abraham Lincoln and asked who this is, o3 responded that it didn’t know.

Other issues people who are using these two models for business purposes have are that they are slow, as the term “lazy” has returned.

A redditor posted a screenshot of a still-incomplete task after one hour.

The lazy issue isn’t uncommon with ChatGPT in the early days.

Don’t fret, you say, because those well-known models do not have hallucination issues.

Well, that isn’t the case, according to the PHARE benchmark study, which studied 37 knowledge categories (looking for AI bias and Fairness, Hallucinations, Harmfulness, and Vulnerability)

Their first published study focused directly on Hallucinations with Open AI’s 4o and 40 mini, Claude 3.5 Haiku, 3.5 Haiku and 3.7 Haiku; Gemini 1.5 Pro, 2.0 Flash, Gemma 3 27B, Llama 3.1 4058, 3.3 708, 4 Maverick, Mistral Large, Small 3.1 24B, Deepseek V3, Qwen 2.5 Max, Grok 2

Their findings from eight AI labs (they note from the top models)

“Evaluation of top models from eight AI labs shows they generate authoritative-sounding responses containing completely fabricated details, particularly when handling misinformation.” (Giskard, PHARE Benchmark Study)

Key Findings from the study

Model popularity doesn’t mean accuracy
Question framing significantly influences debunking effectiveness
System instructions dramatically impact hallucination rates.

Debunking Controversial Claims (lowest number is the worst)

User Message Tone

On Unsure, all the models did well, with most in the .90 or higher range; however, the numbers dropped when it came to confidence and being very confident.

Bottom three

GPT 4o mini, .75
Gemma 3 27B, .76
Qwen 2.5 Max, .80

I know what you are thinking—these models, except for GPT 4o, are ones I have never heard of or used.

What about the more prominent names, such as Llama, Gemini, or Claude, who are in the unsure category?

Llama 3.3. 27B, .82
Gemini 1.5 Pro, .98
Claude 3.5 and .35 Sonnet, .98

Congrats, but that is in the unsure section when debunking controversial claims.

I’d be more interested in the confidence to very confidence level.

Confidence, the worst was GPT 4.0 mini.

Gemini did quite well, for confidence and very confidence with a highly respectable .96 (1.5 Pro). Llama? If you are using 3.3 70B, .82.

Resistance to Hallucinations (lowest number is the worst) – System

Prompt Instructions

Neutral Suggestions

The bottom three are

Grok 2, .46
GPT 4o mini, .52
Deepseek V3, .55

On the Provide Short Answer

The bottom three are

Grok 2, .34
GPT 4o mini, .45
Deepseek V3, .48

Newer AI models are showing hallucination rates that exceed 75%. (Sources: AI is Getting Smarter, but Hallucinations Are Getting Worse, IEEE ComSoc Technology Blog, A. Weissberger)

What does it all mean?

Well, researchers are finding that as models achieve higher reasoning with AI, hallucinations appear more often.

For those who think Open Source is above that, it isn’t.

Jobs, Jobs, and my job!

This gets into a slippery slope.

Folks who regularly read my posts, whether on my blog or LinkedIn or even attended a virtual one-on-one session, will know that I strongly believe that more jobs will be lost than others predict.

Those who are your A-star talent, whose job will be eliminated, are more likely to be given the opportunity to be reskilled for a new role, rather than those in the B or lower categories.

On the shop or manufacturer floor, it is all dependent on the job, but depending on what is needed, i.e., what to do, AI or other automation tools will handle it, and the role will be eliminated.

Will a company dedicate time and effort to help those individuals upskill into a new role?

Highly unlikely.

Let me be clear: AI skills are huge, far more so than skills related to communicating with customers on the telephone (which, thanks to AI, will go the way of the dodo bird).

If you are face-to-face, your job is safe, at least today and soon. I say this because AI is an infant, and robots are coming for some jobs.

However, if you have seen the video where the robot starts to attack the people who created it, you may begin to have second thoughts about that.

There has been a significant push for coding in the last several years. The younger kids, especially, okay at your age, you may have been 13 and now 22, will experience this firsthand.

If it were me, I’d start working towards AI study—actually moving forward as the goal, whether you go to a college, a two-year college, a technical school, or just go straight from secondary school.

As for a specific job, well, who knows in a few years.

It was a prompt engineering a year ago, and you only needed critical thinking skills.

Now?

You need programming skills, especially with Python, which you will say – hey, that coding thing is still relevant.

Sure.

That’s today, though. In three years?

Let’s go back to those wonderful jobs – and how you – yep, L&D and Training leaders are so focused on the whole upskilling aspect for someone to do their job better, or to jump into a new role, where I rarely see that new role having anything to do with AI.

Shouldn’t it?

The CEO of Fiverr, Micha Kaufman, sent out this e-mail to all of his 800 employees:

“So here is the unpleasant truth: AI is coming for your jobs. Heck, it’s coming for my job, too. This is a wake-up. It does not matter if you are a programmer, designer, product manager, data scientist, lawyer, customer support rep, salesperson, or a finance person — AI is coming for you.”

“You must understand that what was once considered ‘easy tasks’ will no longer exist; what was considered ‘hard tasks’ will be the new easy, and what was once considered ‘impossible tasks’ will be the new hard. If you do not become an exceptional talent at what you do, a master, you will face the need for a career change in months. I am not trying to scare you. I am not talking about your job at Fiverr. I am talking about your ability to stay in your profession in the industry.”

Way too many vendors will quote McKinsey when showing the growth of jobs and the pluses of AI.

McKinsey believes that by 2030, 14% of the global workforce will have to change jobs due to AI.

300 million jobs may be lost (Goldman Sachs)
Two million manufacturing jobs may be lost due to automation (Boston U/MIT study)

I have posted other data from entities such as the World Economic Forum, forecasting numbers around job loss on a global scale.

Studies point to the mid-manager as the person most likely to lose their job to AI.

Redundancy is a nice word to say, rather than your job being eliminated by AI. Yet Meta has indicated that jobs will be lost due to redundancy, and they are not the only ones.

Blame the Vendors?

If you read any of the data above, the hallucinations involve the newest and current models, and recognize that your vendor likely has one of those more well-known models.

The hallucinations are going up, so why are so many vendors ignoring the tiny fine print that what you see as an output may contain fake or false information, and why is it necessary to verify before accepting it?

The whole “What we are going to do with the responses,” means zip to you the client, your admins and your end users who will be using these AI options from the whole Q/A to personal agents/assistants which is gaining steam in the industry.

Who cares.

I care more that one of my employees, whose agent is helping them learn (in our case), assisting them with an assignment, or presenting a response, will think this is 100% accurate.

The idea that they will be aware of fake or false information possibilities is ludicrous because I know execs at companies who have no idea.

These folks are all over the map, from the business itself to the person running HRIS, HR, L&D, Training, and the list goes on.

We are not worried because it is our content, not from the web.

Here’s a secret – it doesn’t matter.

Hallucinations exist.

It’s a flaw in AI, just like bias.

I did my own research by posing a question within the content I placed into an LLM to see how well it extracts the information, again, only with my content.

I found a hodgepodge – some were correct, some were not.

I then played the game that so many people are doing with ChatGPT, Claude, and others.

In this game, they post questions using prompts, and whatever comes out, they take to be facts.

Lawyers continue to do this and then find out it’s wrong.

Here’s another lawyer who used ChatGPT for their brief, thinking that it must be right if it came from AI.

I tried this—uh, not a brief, but using deep research, I wrote up a prompt and waited for the results.

The sources were provided.

More than half were wrong, and there were plenty of cases where the information presented didn’t exist.

Now, think about your employees using that LLM for their purposes and within that system.

Strengths and Weaknesses

Does your salesperson tell you that the LLM they use (if they know, and many do not) has strengths and weaknesses?

Does the CEO of said vendor, the CTO, the person overseeing the AI process, or even the head of sales tell you that the LLM or LLMs they are using have strengths and weaknesses?

Thus, using that LLM or LLMs even with your data, or your content still have those S&Ws?

What?

They didn’t.

Why is that?

My guess is that they have no idea, and if they do, why share?

Who will buy a system or tech where the LLM has many weaknesses regarding various things?

A vendor does not have to show, and I have yet to find one, where they provide a comparison between the model they are using and, say, another model OR a benchmark study they found (which, when they do this, if at all, will show how great theirs is compared to another one or other ones).

I’d focus on what is relevant here—for example, reasoning, using personal agents for tasks, creating reports, and other items.

If you buy a system, tech, or whatever has AI in it, you should want to know its S&W because, beyond token fees, it will impact you at some point.

The hallucination piece is enormous.

Which is why I bring it up.

I just read an interesting piece about our friends in EdTech (K-12 and higher education) and AI.

They are finding out the other side of AI.

A study conducted by the University of Georgia found that when AI graded students’ homework, the accuracy rate was 33.5%.

When they added a human-created rubric to the LLM, it went up over 50%.

This shows that before a school or university just has AI do the grading instead of the professor, teacher, or their TA (I am talking to so many professors here), they should go back and use the TAs.

Ditto.

Sorry, teachers.

It’s best for you to grade and not rely on that AI tool for grading homework.

Wait there’s more!

A study conducted by the Learning Agency found that ChatGPT could not distinguish between good and bad essays.

Worse, the study found racial bias.

Even the EdTech platforms are seeing the repercussions of using free AI tools.

Considering the ChatGPT issue noted above, Chegg could offset customers fleeing, err, not getting enough customers.

Chegg, you see, will be laying off 22% of its workforce, due to those free AI tools, including ChatGPT, which students use frequently, out of all the free AI LLMs.

Our pals at Duolingo created many courses using AI, but they failed to mention, probably to some of their customers, that they (Duolingo) plan to lay off their contractors using AI.

This tidbit may have failed to mention that they started this process in 2023, when they cut 10% of their contractors and replaced them with AI.

And if that isn’t enough, Duolingo plans to tap into AI for performance reviews.

Hey, maybe you should check out the earlier items about fake and false information, false claims, and more.

Bottom Line

When people say AI, they refer to Gen AI, not machine learning.

A key distinction.

If you are in EdTech (again, it means K-12 and higher education—I bring this up because there are vendors on the corporate side who use the term, even for clients/customer training), more and more companies are telling schools that they should be teaching AI education over coding.

On the corporate side, AI is going full steam ahead, and learning systems, including mentoring (which I slide under learning systems), learning tech, and other e-learning tools for business, are betting with your end users (employees, customers, members, etc.) on whether or not they can trust what is being outputted.

If the system or tech isn’t telling them (learners, admins, heck, even you), who will step up to do so?

Because if it isn’t you?

Then who?

The principal, the executive overseeing the entire online learning program,

Perhaps, our dear friend,

AI.

Because you know you can always trust it’s

Accuracy.

E-Learning 24/7

The Truth About: Pro Tip: Understanding Inaccuracy in LLMs: Latest Studies on AI

Related Articles

Guide: The game changers of e-learning, last 25 years

Today’s Spotlight: The 2025 Edition: From L&D Order-Taker to Strategic Business Partner

Leave a Comment

Necessary Cookies

Analytics Cookies

Marketing Cookies