The Japan Times - AI is learning to lie, scheme, and threaten its creators

Tokyo 19°C

EUR -

AED 4.327055

AFN 75.406758

ALL 95.495644

AMD 439.303524

ANG 2.108897

AOA 1081.616181

ARS 1622.129361

AUD 1.642752

AWG 2.120816

AZN 1.99729

BAM 1.957355

BBD 2.372544

BDT 144.525784

BGN 1.965409

BHD 0.4447

BIF 3499.345681

BMD 1.178231

BND 1.497264

BOB 8.16935

BRL 5.836833

BSD 1.178016

BTN 109.69834

BWP 15.793277

BYN 3.341297

BYR 23093.325032

BZD 2.369142

CAD 1.607554

CDF 2722.891359

CHF 0.917367

CLF 0.026396

CLP 1038.870123

CNY 8.032884

CNH 8.030339

COP 4218.526021

CRC 536.717204

CUC 1.178231

CUP 31.223118

CVE 110.576494

CZK 24.287521

DJF 209.395262

DKK 7.47287

DOP 71.106011

DZD 155.697739

EGP 61.268322

ERN 17.673463

ETB 185.104055

FJD 2.586158

FKP 0.871465

GBP 0.871125

GEL 3.16358

GGP 0.871465

GHS 13.04282

GIP 0.871465

GMD 86.011116

GNF 10341.921652

GTQ 9.006001

GYD 246.451573

HKD 9.225371

HNL 31.352399

HRK 7.533138

HTG 154.25991

HUF 361.787939

IDR 20184.508663

ILS 3.52175

IMP 0.871465

INR 109.721513

IQD 1543.482438

IRR 1558799.439626

ISK 143.190693

JEP 0.871465

JMD 186.608223

JOD 0.835338

JPY 187.212046

KES 152.168353

KGS 103.035888

KHR 4724.705808

KMF 492.500509

KPW 1060.406232

KRW 1733.908388

KWD 0.363224

KYD 0.981663

KZT 549.437091

LAK 25856.275939

LBP 105494.187853

LKR 372.769763

LRD 217.088712

LSL 19.275921

LTL 3.479009

LVL 0.7127

LYD 7.45233

MAD 10.873598

MDL 20.178685

MGA 4884.944926

MKD 61.625631

MMK 2474.001155

MNT 4211.203844

MOP 9.501186

MRU 45.2554

MUR 54.681006

MVR 18.204002

MWK 2045.990995

MXN 20.394466

MYR 4.653665

MZN 75.353783

NAD 19.275902

NGN 1585.541807

NIO 43.276696

NOK 10.975568

NPR 175.516944

NZD 1.99455

OMR 0.453018

PAB 1.178016

PEN 4.04962

PGK 5.123832

PHP 70.561875

PKR 328.549227

PLN 4.231204

PYG 7510.965961

QAR 4.291128

RON 5.098909

RSD 117.397738

RUB 88.307289

RWF 1720.806184

SAR 4.419447

SBD 9.471462

SCR 16.884433

SDG 708.116482

SEK 10.752122

SGD 1.496713

SHP 0.879668

SLE 29.043159

SLL 24706.90769

SOS 673.358782

SRD 44.123577

STD 24387.000149

STN 24.860671

SVC 10.307012

SYP 130.2494

SZL 19.276061

THB 37.726978

TJS 11.155471

TMT 4.129699

TND 3.402142

TOP 2.836897

TRY 52.894557

TTD 7.994214

TWD 37.03648

TZS 3066.846547

UAH 52.030762

UGX 4364.466697

USD 1.178231

UYU 46.8262

UZS 14268.376418

VES 566.29441

VND 31026.353473

VUV 137.779114

WST 3.199117

XAF 656.467289

XAG 0.014745

XAU 0.000244

XCD 3.184228

XCG 2.12305

XDR 0.817688

XOF 656.274432

XPF 119.331742

YER 281.184731

ZAR 19.276093

ZMK 10605.488828

ZMW 22.293329

ZWL 379.389859

JRI

0.0400

13.13

+0.3%
RBGPF

-13.5000

69

-19.57%
CMSD

0.0050

23.085

+0.02%
CMSC

-0.0398

22.73

-0.18%
BCC

0.9300

83.97

+1.11%
NGG

-0.9000

86.02

-1.05%
GSK

-1.0000

57.35

-1.74%
AZN

-4.1100

200.69

-2.05%
BCE

-0.1400

23.95

-0.58%
RIO

-0.3200

99.83

-0.32%
BTI

0.3800

57.06

+0.67%
BP

0.5300

45.12

+1.17%
RYCEF

-0.4600

17.2

-2.67%
VOD

0.1700

15.65

+1.09%
RELX

0.0600

36.74

+0.16%

AI is learning to lie, scheme, and threaten its creators / Photo: HENRY NICHOLLS - AFP

AI is learning to lie, scheme, and threaten its creators

ECONOMY 29.06.2025

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

H.Hayashi--JT

The Japan Times - AI is learning to lie, scheme, and threaten its creators

AI is learning to lie, scheme, and threaten its creators

Featured

Amazon invests another $5 bn in Anthropic

US, Iran warn ready for war as talks in limbo

Apple's Tim Cook to step down as CEO in September

Colombian environmental activist honored amid threats and exile