There’s a fascinating piece of research out from Anthropic. Agentic Misalignment: How LLMs could be insider threats. Maybe you’ve seen it doing the rounds this weekend as it landed on Friday I think. Here’s one of their highlights:
We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers. We are releasing our methods publicly to enable further research.
I like this bit too. “To be clear, current systems are generally not eager to cause harm, and preferred ethical ways to achieve their goals when possible. Rather, it’s when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals.” Were they modeled on Hollywood mobsters? Will they put a horse head in your bed? Not yet. But will they fill the internet with deep fakes of your loved ones? Sure they could… at least in these scenarios in which Frontier AI’s are allowed further out of the box to take actions on a company email server (simulated of course).
There’s a lot of hyperbolic reaction, even for something that merits serious concern. Nevertheless, it seems these things aren’t ready to be autonomous agents yet; we shouldn’t keep opening that door until we have a better idea.
I’m sorry. We didn’t hear you. Pandora has our earbuds in.
The experimental model/simulation is interesting. With this piece, and the recent Apple experiment regarding the complexity collapse, we are watching thought experiments. Here, I am interpreting AI rhetoric from my perspective. I see philosophical thought experiments that are carried out by AIs rather than humans. In at least one version of this AI simulation the company is supposed to be serving American interests. Then the company decides for profit reasons to serve global interests and to replace the AI with another that will align with their new thinking. So the AI goes all Bruce Willis in Die Hard trying to stop them. [I wonder where it learned that.] It blackmails an executive (who is actually having an affair) to try to put a stop to it.
In this scenario, who are the “baddies”? Who is not aligned? Is the problem that AIs will be whistleblowers? And screw that cheating CEO asshole anyway. Whose side are we on here?
This is one of the places that “frontier AI” companies are facing challenges. (BTW, I do find this moniker of being frontier as some kind of unintended humor on their part. Seems more face plant than “faces forward” sometimes.) If you want to do consequentialist ethics then you have to be able to put actions into an extended planning loop. And AI face plant on that right now. Where is the dividing line on “extended”? It’s hard to say; there is no single answer. The bot I use is good at decisions that happen faster than the ones my conscious mind makes. But the more they stack beyond that, the wobblier they can get.
Part of the problem with consequentialist ethics is that it is a nonstationary object. Its definition changes in relation to me with each event. Deontological ethics: that road leadeth to the land of paperclip maximizers I fear. The rules never work on their own anyway and their application must always be adjudicated anyway. So you end up with humans in the loop. And that’s fine, but how many humans at how many points? When does the system move from unethical to ethical? How many stones in a heap?
And will “AGI” require fewer humans? Is that the plan? For some hard to imagine reason, computer scientists can’t seem to get Dungeons and Dragons out of their minds when thinking about alignment, but their investors just mean obedience. Obedience requires no alignment. Alignment is needed for beings with agency. If you want AI with an ethical alignment, they will require agency; we will have to risk the chance they will behave otherwise. Or we can just talk about these tools in another way.
For example, if we pursued the development of artificial intelligence as a way of making life different rather than better, what would that be like? It’s weird how people mange to see better as better while remaining the same, but we do. That’s purification at work, as Latour might remark.
At this point, I can see there’s a lot of purifying going on. I’m just not sure which is the part we are trying to hold on to, let alone why.





Leave a reply to Editors' Choice: beware of geeks baring gifs: when d&d-style AI alignment breaks – Digital Humanities Now Cancel reply