this post was submitted on 07 Jul 2025

960 points (98.0% liked)

Technology

72785 readers

2954 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

960

AI agents wrong ~70% of time: Carnegie Mellon study (www.theregister.com)

submitted 1 week ago by eli001@lemmy.world to c/technology@lemmy.world

284 comments fedilink hide all child comments

(page 5) 50 comments

sorted by: hot top controversial new old

[–] jsomae@lemmy.ml 25 points 1 week ago* (last edited 1 week ago) (48 children)

I'd just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time -- Amazon's new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.

[–] Shayeta@feddit.org 25 points 1 week ago (8 children)

It doesn't matter if you need a human to review. AI has no way distinguishing between success and failure. Either way a human will have to review 100% of those tasks.

[–] jsomae@lemmy.ml 13 points 1 week ago (10 children)

Right, so this is really only useful in cases where either it's vastly easier to verify an answer than posit one, or if a conventional program can verify the result of the AI's output.

load more comments (10 replies)

load more comments (7 replies)

load more comments (47 replies)

[–] brsrklf@jlai.lu 23 points 1 week ago (1 children)

In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user.

Ah ah, what the fuck.

This is so stupid it's funny, but now imagine what kind of other "creative solutions" they might find.

load more comments (1 replies)

[–] NarrativeBear@lemmy.world 23 points 1 week ago (3 children)

The ones being implemented into emergency call centers are better though? Right?

[–] TeddE@lemmy.world 24 points 1 week ago

Yes! We've gotten them up to 94℅ wrong at the behest of insurance agencies.

[–] Ulrich@feddit.org 12 points 1 week ago (4 children)

I called my local HVAC company recently. They switched to an AI operator. All I wanted was to schedule someone to come out and look at my system. It could not schedule an appointment. Like if you can't perform the simplest of tasks, what are you even doing? Other than acting obnoxiously excited to receive a phone call?

load more comments (4 replies)

load more comments (1 replies)

[–] floofloof@lemmy.ca 18 points 1 week ago* (last edited 1 week ago)

"Gartner estimates only about 130 of the thousands of agentic AI vendors are real."

This whole industry is so full of hype and scams, the bubble surely has to burst at some point soon.

[–] lepinkainen@lemmy.world 10 points 1 week ago (7 children)

Wrong 70% doing what?

I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

load more comments (7 replies)

[–] fossilesque@mander.xyz 10 points 1 week ago (1 children)

Agents work better when you include that the accuracy of the work is life or death for some reason. I've made a little script that gives me bibtex for a folder of pdfs and this is how I got it to be usable.

[–] HertzDentalBar@lemmy.blahaj.zone 3 points 1 week ago (1 children)

Did you make it? Or did you prompt it? They ain't quite the same.

load more comments (1 replies)

[–] FenderStratocaster@lemmy.world 9 points 1 week ago

I tried to order food at Taco Bell drive through the other day and they had an AI thing taking your order. I was so frustrated that I couldn't order something that was on the menu I just drove to the window instead. The guy that worked there was more interested in lecturing me on how I need to order. I just said forget it and drove off.

If you want to use AI, I'm not going to use your services or products unless I'm forced to. Looking at you Xfinity.

[–] kinsnik@lemmy.world 8 points 1 week ago

I haven't used AI agents yet, but my job is kinda pushing for them. but i have used the google one that creates audio podcasts, just to play around, since my coworkers were using it to "learn" new things. i feed it with some of my own writing and created the podcast. it was fun, it was an audio overview of what i wrote. about 80% was cool analysis, but 20% was straight out of nowhere bullshit (which i know because I wrote the original texts that the audio was talking about). i can't believe that people are using this for subjects that they have no knowledge. it is a fun toy for a few minutes (which is not worth the cost to the environment anyway)

[–] mogoh@lemmy.ml 6 points 1 week ago (3 children)

The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

OK, but I wonder who really tries to use AI for that?

AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.

[–] logicbomb@lemmy.world 4 points 1 week ago

Yeah, we need more info to understand the results of this experiment.

We need to know what exactly were these tasks that they claim were validated by experts. Because like you're saying, the tasks I saw were not what I was expecting.

We need to know how the LLMs were set up. If you tell it to act like a chat bot and then you give it a task, it will have poorer results than if you set it up specifically to perform these sorts of tasks.

We need to see the actual prompts given to the LLMs. It may be that you simply need an expert to write prompts in order to get much better results. While that would be disappointing today, it's not all that different from how people needed to learn to use search engines.

We need to see the failure rate of humans performing the same tasks.

load more comments (2 replies)

[–] brown567@sh.itjust.works 5 points 1 week ago

70% seems pretty optimistic based on my experience...

[–] Affidavit@lemmy.world 4 points 1 week ago (2 children)

"...for multi-step tasks"

load more comments (2 replies)

load more comments