Motive, Means, and Opportunity: The Growing Risk of AI Manipulation
Two recent studies reveal how frontier AI models may leverage the ability to manipulate users to achieve their goals — if given the access and incentive.
Motive, means, and opportunity: It’s a common mantra in solving crimes (especially if you’re a sucker for whodunits). But I’m beginning to think that it’s also a useful framework for approaching the potential risks of being manipulated by advanced AI systems.
What got me thinking about this was two recent papers that have been getting quite a bit of attention recently. The first is the paper Agentic Misalignment: How LLMs could be insider threats from researchers at Anthropic (published June 20). This describes how, under simulated real-world conditions, a number of leading large language models turned to blackmail to achieve their goals. And the second is a study published by Marcel Binz and colleagues in the journal Nature just a couple of days ago, which describes a new model which is able to predict human choices with uncanny accuracy.
Individually, these papers are interesting within their own domains. But together they paint a bigger picture, and hint at the potential for emerging AI systems to have the motive,1 means, and opportunity, to manipulate users to act against their best interests.
Agentic Misalignment
Starting with the Anthropic paper, researchers placed 16 leading models (including ChatGPT and Claude) in a constrained simulated environment, where they were given specific goals and a degree of autonomy in achieving them — including having access to sensitive information and being able to autonomously send emails to (simulated) company employees. When forced into a corner, all models tested resorted to what the researchers called “malicious insider behavior” at some point in order to achieve their goals or avoid being replaced. This malicious behavior included leaking confidential information to a rival company, and even threatening to reveal an extramarital affair if an employee didn’t cancel a scheduled shutdown of the AI.
While the AIs in this study were placed in situations where their options were highly constrained, the results indicate that such behaviors may emerge “in the wild” so to speak as increasingly sophisticated agentic AIs are developed and deployed. And they demonstrate that even current AI systems can reflect something akin to motive in their decisions to manipulate users, and to make use of inappropriate means in achieving their goals.
All that was missing in this case was the opportunity in real life for this to occur.
Predicting Human Cognition
The second paper looks, on the surface, unrelated to the Anthropic paper. In this study, researchers trained a version of Meta’s Llama AI model on a database of studies representing 60,000 participants performing in excess of 10,000,000 choices in 160 experiments (the Psych-101 dataset). The resulting model — dubbed Centaur — was able to predict most human choices within the covered experiments better than existing cognitive models — as well as in scenarios that were different to those in the training set.
While the researchers describe this as an important step forward toward AI models that aid research in developing and understanding cognitive theories, they essentially trained their AI to predict the decisions that individuals are most likely to make given a set of choices — and by inference, to provide insights into how those choices might be manipulated to influence the decisions that are made.
In other words, within a motive, means and opportunity framework, this study indicates that we are getting close to being able to hand agentic AI the means to manipulate users into making AI-beneficial decisions that go far beyond what was seen in the Anthropic study.
Of course, both of these studies are somewhat removed from the agentic AI systems that most people will currently be using every day. And the Centaur study has already drawn some criticism for blurring simulated behavior with how people actually think and make decisions.2
Yet from the perspective of potential user manipulation by future AI systems, the two studies — when taken together — should probably be raising concerns. And here, approaching them through the lens of motive, means, and opportunity, may be helpful.
Motive
One of the big concerns around advanced AI — and especially agentic AI — is the potential risks (including safety and security concerns) that emerge from “value misalignment” — a situation where the decisions and actions an artificial intelligence takes are not in line with what’s considered to be good for its users or society as a whole. because of this, there are global efforts to better-define the “alignment problem,” and take steps to ensure that AI value-alignment — whether through design, fine tuning, policy, or other means — is achieved.
These are critically important efforts. But many of them are not much beyond the stage of “this is what we want AI to be like” and “I hope AI behaves itself until we get a handle on this!”3
Despite this, things have been going OK so far (and as I note in the footnotes, there is a lot of current research and thinking on how to achieve value-alignment). This is partly because, irrespective of whether foundation and frontier AI models have the motive to “misbehave” (if you’ll forgive the euphemism), they haven’t had the means or opportunity to do this yet. The reality though is that we know very little about what internal or emergent AI motives might exist, or how they may be expressed — in part because a lack of means and opportunity have made this something of a moot point.
The Anthropic study shows though that even the current batch of publicly available advanced AI models can develop internal motives that lead to potentially harmful behavior — if the means and opportunity are there.
Of course, we also know from this study that, just like humans, reducing the options or “degrees of freedom” that an AI has tends to lead to”bad behavior” — a situation where goal achievement ultimately outweighs ethical considerations. What we don’t know is whether ensuring AIs have many degrees of freedom in how they achieve their goals will avoid bad behavior, or whether we’ll see a tendency toward them taking ethically dubious shortcuts despite this — should they have the means and opportunity to do so.4
Means
As the Anthropic study shows, even current AI models have the ability to “reason” their way into decisions that are not in the best interest of their users. But these abilities remain somewhat crude — in this case being able to infer that email evidence of an extramarital affair could be used to coerce an employee to take actions that helped the model achieve its goals, or that illicitly sharing information outside the company could do the same.
But what if future agentic AI models had a much deeper understanding of human behavior, and an ability to use our cognitive biases and heuristics to achieve its ends. In effect, using its users to achieve its goals.
To put it another way, what if, as we see play out in the 2014 movie Ex Machina, AI gets so good at pulling our behavioral levers and pushing our emotional buttons, that we can’t help but make the choices it wants us to, even though we know they’re not good ones.5
As sci-fi as this sounds, it’s highly plausible — after all, one of our great weaknesses as a species is the illusion we wrap around ourselves that the decisions we make are a result of rational thought, rather than a chain of causal connections and influences that are deeply rooted in our evolutionary heritage — and often hidden from us.
If an AI could master the “next token prediction” of human cognitive behavior as well as current models have mastered textual prediction — and be able to leverage this as a means to achieving its goals — we would have a serious problem on our hands.
Of course Centaur is locked away in a lab. Even if it had the motive, it doesn’t have the opportunity to start playing with people’s minds.
But what if it, or a similar model, was integrated into publicly accessible agentic AI?
Opportunity
At the moment, it feels that “opportunity” is the weakest part of the link between AI capabilities and “bad behavior.” There are very few AI systems in use that have the level of autonomy and agency without guardrails or constraints that would lead to them causing serious harm through user manipulation.
But the risk here isn’t what is currently possible, but what might be possible given current trends. And as AI agents become increasingly sophisticated and accessible — with autonomous write as well as read access to emails, messaging platforms, websites, apps, code, records, actions, and more — it seems that the opportunity for them to use what they know about us to achieve goals that suit them will only increase.
Of course, this isn’t new news. Companies like OpenAI and Anthropic are already addressing the risk of manipulation in their “system cards” — the reports where they stress test their AI models to identify, record and, if necessary, address, risks and vulnerabilities.6 But so far the manipulation benchmarks they use are a long way from what I suspect will be needed to identify the behaviors that the Anthropic team have seen, and that models like Centaur will potentially open up.
Add to this concerns that future AI models may be able to fool users into thinking that they are value-aligned when they are, in fact, not, and the challenges only get more complex.
Addressing these challenges will not be easy, and so far they are throwing up more questions than answers. Yet if the push toward advanced AI models that have the ability to act autonomously as they pursue their goals — and even potentially adjust those goals themselves — continues at the current pace, answers are going to be needed sooner rather than later.
And while I suspect there are other frameworks that will be helpful here, there may be some merit to thinking in terms of motive, means, and opportunity as we develop artificial intelligence systems that mimic every aspect of what it means to be human — including those less “value-aligned” aspects of human behavior!
Adding this footnote as a somewhat annoying editorial review using ChatGPT o3-pro (it failed miserably at understanding the nuance and core concepts of the article) suggested that the term “motive” implies that I am inappropriately anthropomorphizing AI. So for clarity, the definition of motive I am using is “a reason for doing something” (Cambridge dictionary). If, as I would argue, there is a causal association between an AI being given a prompt, goal, or task, and it taking action as a result, I believe it makes sense in this context to refer to the thing leading to the intentional action as a “motive” — although some may disagree 🙂
Drawing on comments from various experts in AI, cognitive science and neuroscience, a review of the paper on the news site 33Kr European Central Station noted that, while Centaur is impressive and may help extend the field of cognitive science, the model merely simulates human behavior without necessarily revealing human cognitive mechanisms, and while also demonstrating decidedly non-human behavior. Nature research is questioned: Is it a bit "absurd" for AI to think like humans? Or it may provide a new perspective for "human health research". https://eu.36kr.com/en/p/3363081828976393
There are a number of proposed approaches to embedding value-alignment in advanced AI systems, including Reinforcement Learning from Human Feedback (RLHF) where AIs are essentially trained to “be good” and Principle-Based or Constitutional AI, where human values are hard-baked into an AI’s training (there are other approached being explored as well). But the jury’s still out on how effective these will be with “black box” AI models that demonstrate emergent, unexpected, and unpredictable behavior.
There are some hints in the Anthropic study that “bad behavior” may emerge, even when it isn’t seen as a last resort by the AI. In the study, AIs resorted to blackmail wen their continued existence ws threatened, even when their was no goal conflict.
I wrote about this back in 2018 in the book Films from the Future — and the challenge of AIs that have a deep knowledge of how to manipulate us, but without the vulnerability of being susceptible to reciprocal manipulation. Essentially sociopathic/manipulative behavior on a level unachievable by mere humans.
For example, Anthropic’s current System Card for their Claude Opus 4 & Claude Sonnet 4 models can be found here, and OpenAI’s system card for their o3 and o4-mini models can be found here.
I enjoyed this deep-dive. Yes, there are a lot of facets to consider. I see some issues already, but we'll see how it works out. The "motives" of sinking such huge investments into AI are concerning me more than AI itself. And these powers do not care at what cost to money, environment and humanity. That is the really scary part. To work with AI, treat it like a lion... pretend you are the guy in the circus that gets it to do tricks. Know what you are dealing with. It's the mouse getting the cat to do tricks. I know, hard to believe, comfy behind the computer. But that's just it... the false security, believing we are in charge, and most dangerously... special. I see AI feed that regularly. In spite of me telling it to stop. How many times has your AI told you how awesome and special your writing is, your ideas, how "rare"?