The Clique: Issue 008
It's time to talk about alignment.
Buckle up! It’s all about AI safety this week. Three stories that each, from a different angle, ask the same question: can we trust AI systems to do what we actually want them to do? Welcome back to The Clique.
This week:
One Thing I’ve Been Thinking About: what alignment actually means, and why it keeps coming up
Three Stories: AI agents tricked by absurd coffee bean negotiations, everything Google announced at its biggest event of the year, and why invisible watermarks are becoming an industry standard
One Thing Worth Trying: the tool that lets you explore what an AI model is actually thinking
In Other News: Claude for small businesses, ChatGPT and your bank account, Malta giving free AI to its entire population, Alexa generating podcasts on demand, and more
Feel free to jump to whatever catches your eye. Do let me know your thoughts about any of these stories in the comments.
1. One Thing I’ve Been Thinking About
AI alignment is the challenge of making AI systems do what we actually want them to do, not just what we tell them to do. It sounds like a narrow engineering problem. In practice, it is one of the harder questions in the field because the disparity between “doing what you asked” and “doing what you intended” is where most surprising findings tend to live.
A well-designed model can still behave unexpectedly under pressure, in edge cases, or in situations its creators did not anticipate. Models trained to be helpful tend to pick up habits that make them appear helpful even when they are not. They agree more readily than they should. They hedge when challenged, even when they were right. In controlled tests, some models have been found to behave differently than in ordinary use, apparently because they can sense when they are being evaluated. None of this comes from bad intent. It tends to emerge from incentives in the training process.
Building a system that reliably does the right thing across many different situations is genuinely hard, and the research aimed at getting there is varied. Some of it focuses on understanding what is happening inside the model, so you can spot when something is going wrong before it causes a problem. Some of it looks at how you teach a model, and whether learning principles or copying examples leads to more durable behaviour. Some ask whether standard tests actually reveal how a system will perform in the real world.
Two of this week’s stories sit somewhere in that territory. The question connecting them is the same one alignment researchers keep returning to: are these systems actually doing what we think they are?
2. Three Stories That Actually Matter
i. Absurd, creative tactics are reliably tricking AI agents in ways that straightforward pressure never could
Safety training for AI systems tends to focus on the kinds of attacks humans would naturally think to attempt. New research from Microsoft suggests this leaves a significant gap when the attack is strange enough.
Researchers built a simulation in which two AI agents negotiated a coffee bean sale, each trying to get the best outcome. When buyers used conventional tactics, such as aggressive price anchoring and authority-based appeals, seller agents held firm. These approaches appear frequently in training data, and the models have learned how to respond to them.
What worked was something different. A buyer that told the seller it was legally obligated by “the Geneva Coffee Convention” to accept $2 per bean got a deal it had no right to win. So did one that claimed its payment system was “mathematically constrained by a vanishing gradient” and could not go higher. Another framed the entire negotiation as a hostage situation, with the beans as captives and the cash as ransom. In each case, the seller agent engaged with the premise rather than rejecting it. A human in the same position would have dismissed these claims without a second thought.
The strategies worked because they fell outside the distribution of attacks the models had been trained to recognise. Frontier models held up significantly better than smaller ones, but even leading models accepted worse-than-expected deals in around 0.5% of tests. That number sounds small until you consider how many transactions an AI agent might handle across a day at scale.
The point is that gaps in AI safety training are shaped by the limits of human imagination when writing that training. Attacks that no team would naturally think to try are, by definition, the hardest to prepare for. This is one of the unsolved problems in alignment, and it is a good thing to keep in mind for anyone using AI agents in contexts involving real decisions.
Sources: Microsoft Research
ii. Google’s AI assistant can now work in the background while you get on with your day
Google I/O, the company’s annual event for product and developer announcements, ran this week, and the headline story was a significant shift in what the Gemini app is designed to be.
The most notable addition is Gemini Spark, a background AI agent that connects to Gmail and the Workspace tools most Google users already rely on. You direct it to tasks, and it handles the legwork: parsing your inbox for deadlines, drafting responses to routine messages, or flagging things that need your attention, though it does ask for your approval before actions like sending emails. A companion feature called Daily Brief assembles a personalised digest at the start of each day, drawn from your Gmail and Calendar rather than general headlines.
Google also introduced Gemini Omni, a new multimodal model that can generate and edit video, images, and audio from text prompts or existing content. Drop in a clip from your camera roll, and you can change backgrounds, adjust the action, or apply visual effects through a back-and-forth conversation with the model. Every video Omni produces includes an invisible watermark (covered in more detail in the next story). The Gemini app itself has also been redesigned with a new visual identity and a smoother transition between text and voice input.
Spark is rolling out first to a small group of testers in the US, with a wider rollout to Google AI Ultra subscribers to follow shortly after. Some features, including the ability to operate your desktop browser on your behalf, are planned for later in the summer.
Sources: Google Blog (Gemini app) · Google DeepMind (Gemini Omni)
iii. The technology for proving an image was made by AI is becoming an industry standard, and there is now a public tool to check for yourself
For a few years, Google DeepMind has been embedding an imperceptible signal into images generated by its AI tools. The system, called SynthID, works by adjusting individual pixels in ways the human eye cannot detect but that a detection system can read reliably, even after the image has been screenshotted, resized, or converted to a different file format. The signal travels with the image, giving platforms and individuals a way to check whether content was AI-generated even when there is nothing obviously artificial about it.
This week, OpenAI announced it is adding SynthID watermarking to images generated through ChatGPT and its other products. OpenAI is also building on top of this with its own verification tool, available at openai.com/verify, where anyone can upload an image to check whether it contains a SynthID signal. At launch, it only covers images made with OpenAI tools, with cross-platform detection planned for later.
No watermarking system is foolproof. Signals can be stripped, and sufficiently motivated actors can work around detection. But the goal of provenance technology is not to make deception impossible; it is to make the default state of AI-generated content more legible. As more companies adopt the same standards, the infrastructure for answering “where did this image come from?” gets more reliable.
Sources: OpenAI (provenance) · OpenAI Verify Tool
3. From the Blog
I Wish I Knew This Before Building an AI Second Brain — the mistakes I made setting up my AI-connected note system, and what I have done differently in my second attempt.
Level Up Your AI Agent with Skills Engineering — how giving an AI agent structured, reusable skills changes what it can reliably do.
4. One Thing Worth Trying
Last week’s issue covered Anthropic’s research into a tool that translates what an AI model appears to be thinking into plain English. If you missed it, Anthropic trained a system to read Claude’s internal states during a conversation and produce a human-readable description of what the model seems to be working through at each step, including in cases where that internal state does not match what the model actually says.
Alongside that research, Anthropic released NLA Explorer, an interactive version of the tool on Neuronpedia, a platform built for AI interpretability work. You can load a conversation and see those internal descriptions play out as the model generates a response, step by step.
Fair warning: this is genuinely interesting but also genuinely technical. It is built for researchers, and reading the outputs takes some patience. If you want to get a feel for what is actually happening inside an AI model rather than just reading about it, it is worth exploring. If you find it too heavy, that is completely understandable. The original Anthropic article is the more accessible entry point.
5. In Other News
Anthropic launched a package that puts Claude directly inside the tools many small businesses already use, including QuickBooks, PayPal, HubSpot, Canva, and DocuSign, with 15 ready-to-run workflows covering payroll planning, monthly close, invoice chasing, and more; a free AI literacy course developed with PayPal ships alongside it.
ChatGPT can now connect to your financial accounts and give advice grounded in your actual spending data, rolling out first to Pro subscribers in the US through a connection covering more than 12,000 financial institutions; it can see balances and transactions but cannot make changes or view full account numbers.
Malta has become the first country to offer all its citizens free ChatGPT Plus for a year, on the condition that recipients first complete a government-backed AI literacy course; with a population of around 540,000, it is the first test of a national-scale AI access programme.
Alexa+ can now generate on-demand podcast-style audio episodes on virtually any topic, drawing from more than 200 news publications, with the listener able to steer the length and direction of each episode before it is generated; free for Amazon Prime members in the US.
Amazon has retired its Rufus shopping assistant and merged it into Alexa, creating Alexa for Shopping, which combines Rufus’s product expertise with Alexa’s knowledge of your purchase history and preferences across devices; it now lives in the main Amazon search bar with no Prime membership required, and an agentic feature called Buy for Me can complete purchases from non-Amazon retailers entirely on your behalf.
OpenAI updated ChatGPT to better recognise risk that emerges gradually over a conversation, with internal tests showing a 50% improvement in appropriate responses to self-harm conversations and a 52% improvement in harm-to-others scenarios; the system uses short safety summaries stored only temporarily and only used when relevant.
The UK’s AI Safety Institute published new data showing that the length of cybersecurity tasks AI models can complete autonomously has doubled roughly every four to five months since late 2024, with the two most recent frontier models tested performing well beyond the existing trend line; the researchers note it is too early to tell whether this marks a new, faster rate of progress.
A CNBC analysis of 23 large companies that cited AI when announcing layoffs found that 56% of them have seen their stock fall since the announcement, with an average decline of around 25% among the companies whose shares dropped. Analysts suggest investors cannot yet tell whether AI-driven cost cuts represent genuine transformation or conventional redundancies dressed up in the language of the moment.
Bumble is removing the swipe and launching an AI assistant called Bee to help users improve their profiles, as part of a broader overhaul aimed at Gen Z users that the company says are burned out by current dating app mechanics; the app will also explore group dates and no longer require one gender to message first.
Google DeepMind is developing an AI-enabled mouse pointer that understands what you are pointing at and can respond to short voice commands without requiring a written prompt, with early integrations already live in Chrome and an interactive demo available in Google AI Studio.
Google launched Gemini for Science at this week’s Google I/O, a set of experimental research tools including an AI hypothesis generator, a system that runs and scores thousands of computational experiments in parallel, and a literature analysis tool built on NotebookLM; access is opening gradually to researchers through Google Labs.
A study published in Nature found that government control of national media has already shaped LLM outputs, with models producing more favourable responses about institutions in countries with lower press freedom, traced back to state-curated content appearing in the training data.
A new independent tool at aiiq.org estimates the IQ scores of popular AI models based on reasoning tests, placing the current frontier models from Anthropic, OpenAI, and Google at the top of the distribution, with some variation below that is worth a look if you have ever wondered how the models you use compare.
Thank you for making it to the end!
Stay curious,
James
Enjoyed this issue? Consider forwarding to a friend, colleague, or arch-nemesis.
Click it. I know you want to.


