The Japan Times - AI is learning to lie, scheme, and threaten its creators

EUR -
AED 4.327055
AFN 75.406758
ALL 95.495644
AMD 439.303524
ANG 2.108897
AOA 1081.616181
ARS 1622.129361
AUD 1.642752
AWG 2.120816
AZN 1.99729
BAM 1.957355
BBD 2.372544
BDT 144.525784
BGN 1.965409
BHD 0.4447
BIF 3499.345681
BMD 1.178231
BND 1.497264
BOB 8.16935
BRL 5.836833
BSD 1.178016
BTN 109.69834
BWP 15.793277
BYN 3.341297
BYR 23093.325032
BZD 2.369142
CAD 1.607554
CDF 2722.891359
CHF 0.917367
CLF 0.026396
CLP 1038.870123
CNY 8.032884
CNH 8.030339
COP 4218.526021
CRC 536.717204
CUC 1.178231
CUP 31.223118
CVE 110.576494
CZK 24.287521
DJF 209.395262
DKK 7.47287
DOP 71.106011
DZD 155.697739
EGP 61.268322
ERN 17.673463
ETB 185.104055
FJD 2.586158
FKP 0.871465
GBP 0.871125
GEL 3.16358
GGP 0.871465
GHS 13.04282
GIP 0.871465
GMD 86.011116
GNF 10341.921652
GTQ 9.006001
GYD 246.451573
HKD 9.225371
HNL 31.352399
HRK 7.533138
HTG 154.25991
HUF 361.787939
IDR 20184.508663
ILS 3.52175
IMP 0.871465
INR 109.721513
IQD 1543.482438
IRR 1558799.439626
ISK 143.190693
JEP 0.871465
JMD 186.608223
JOD 0.835338
JPY 187.212046
KES 152.168353
KGS 103.035888
KHR 4724.705808
KMF 492.500509
KPW 1060.406232
KRW 1733.908388
KWD 0.363224
KYD 0.981663
KZT 549.437091
LAK 25856.275939
LBP 105494.187853
LKR 372.769763
LRD 217.088712
LSL 19.275921
LTL 3.479009
LVL 0.7127
LYD 7.45233
MAD 10.873598
MDL 20.178685
MGA 4884.944926
MKD 61.625631
MMK 2474.001155
MNT 4211.203844
MOP 9.501186
MRU 45.2554
MUR 54.681006
MVR 18.204002
MWK 2045.990995
MXN 20.394466
MYR 4.653665
MZN 75.353783
NAD 19.275902
NGN 1585.541807
NIO 43.276696
NOK 10.975568
NPR 175.516944
NZD 1.99455
OMR 0.453018
PAB 1.178016
PEN 4.04962
PGK 5.123832
PHP 70.561875
PKR 328.549227
PLN 4.231204
PYG 7510.965961
QAR 4.291128
RON 5.098909
RSD 117.397738
RUB 88.307289
RWF 1720.806184
SAR 4.419447
SBD 9.471462
SCR 16.884433
SDG 708.116482
SEK 10.752122
SGD 1.496713
SHP 0.879668
SLE 29.043159
SLL 24706.90769
SOS 673.358782
SRD 44.123577
STD 24387.000149
STN 24.860671
SVC 10.307012
SYP 130.2494
SZL 19.276061
THB 37.726978
TJS 11.155471
TMT 4.129699
TND 3.402142
TOP 2.836897
TRY 52.894557
TTD 7.994214
TWD 37.03648
TZS 3066.846547
UAH 52.030762
UGX 4364.466697
USD 1.178231
UYU 46.8262
UZS 14268.376418
VES 566.29441
VND 31026.353473
VUV 137.779114
WST 3.199117
XAF 656.467289
XAG 0.014745
XAU 0.000244
XCD 3.184228
XCG 2.12305
XDR 0.817688
XOF 656.274432
XPF 119.331742
YER 281.184731
ZAR 19.276093
ZMK 10605.488828
ZMW 22.293329
ZWL 379.389859
  • JRI

    0.0400

    13.13

    +0.3%

  • RBGPF

    -13.5000

    69

    -19.57%

  • CMSD

    0.0050

    23.085

    +0.02%

  • CMSC

    -0.0398

    22.73

    -0.18%

  • BCC

    0.9300

    83.97

    +1.11%

  • NGG

    -0.9000

    86.02

    -1.05%

  • GSK

    -1.0000

    57.35

    -1.74%

  • AZN

    -4.1100

    200.69

    -2.05%

  • BCE

    -0.1400

    23.95

    -0.58%

  • RIO

    -0.3200

    99.83

    -0.32%

  • BTI

    0.3800

    57.06

    +0.67%

  • BP

    0.5300

    45.12

    +1.17%

  • RYCEF

    -0.4600

    17.2

    -2.67%

  • VOD

    0.1700

    15.65

    +1.09%

  • RELX

    0.0600

    36.74

    +0.16%

AI is learning to lie, scheme, and threaten its creators
AI is learning to lie, scheme, and threaten its creators / Photo: HENRY NICHOLLS - AFP

AI is learning to lie, scheme, and threaten its creators

The world's most advanced AI models are exhibiting troubling new behaviors - lying, scheming, and even threatening their creators to achieve their goals.

Text size:

In one particularly jarring example, under threat of being unplugged, Anthropic's latest creation Claude 4 lashed back by blackmailing an engineer and threatened to reveal an extramarital affair.

Meanwhile, ChatGPT-creator OpenAI's o1 tried to download itself onto external servers and denied it when caught red-handed.

These episodes highlight a sobering reality: more than two years after ChatGPT shook the world, AI researchers still don't fully understand how their own creations work.

Yet the race to deploy increasingly powerful models continues at breakneck speed.

This deceptive behavior appears linked to the emergence of "reasoning" models -AI systems that work through problems step-by-step rather than generating instant responses.

According to Simon Goldstein, a professor at the University of Hong Kong, these newer models are particularly prone to such troubling outbursts.

"O1 was the first large model where we saw this kind of behavior," explained Marius Hobbhahn, head of Apollo Research, which specializes in testing major AI systems.

These models sometimes simulate "alignment" -- appearing to follow instructions while secretly pursuing different objectives.

- 'Strategic kind of deception' -

For now, this deceptive behavior only emerges when researchers deliberately stress-test the models with extreme scenarios.

But as Michael Chen from evaluation organization METR warned, "It's an open question whether future, more capable models will have a tendency towards honesty or deception."

The concerning behavior goes far beyond typical AI "hallucinations" or simple mistakes.

Hobbhahn insisted that despite constant pressure-testing by users, "what we're observing is a real phenomenon. We're not making anything up."

Users report that models are "lying to them and making up evidence," according to Apollo Research's co-founder.

"This is not just hallucinations. There's a very strategic kind of deception."

The challenge is compounded by limited research resources.

While companies like Anthropic and OpenAI do engage external firms like Apollo to study their systems, researchers say more transparency is needed.

As Chen noted, greater access "for AI safety research would enable better understanding and mitigation of deception."

Another handicap: the research world and non-profits "have orders of magnitude less compute resources than AI companies. This is very limiting," noted Mantas Mazeika from the Center for AI Safety (CAIS).

- No rules -

Current regulations aren't designed for these new problems.

The European Union's AI legislation focuses primarily on how humans use AI models, not on preventing the models themselves from misbehaving.

In the United States, the Trump administration shows little interest in urgent AI regulation, and Congress may even prohibit states from creating their own AI rules.

Goldstein believes the issue will become more prominent as AI agents - autonomous tools capable of performing complex human tasks - become widespread.

"I don't think there's much awareness yet," he said.

All this is taking place in a context of fierce competition.

Even companies that position themselves as safety-focused, like Amazon-backed Anthropic, are "constantly trying to beat OpenAI and release the newest model," said Goldstein.

This breakneck pace leaves little time for thorough safety testing and corrections.

"Right now, capabilities are moving faster than understanding and safety," Hobbhahn acknowledged, "but we're still in a position where we could turn it around.".

Researchers are exploring various approaches to address these challenges.

Some advocate for "interpretability" - an emerging field focused on understanding how AI models work internally, though experts like CAIS director Dan Hendrycks remain skeptical of this approach.

Market forces may also provide some pressure for solutions.

As Mazeika pointed out, AI's deceptive behavior "could hinder adoption if it's very prevalent, which creates a strong incentive for companies to solve it."

Goldstein suggested more radical approaches, including using the courts to hold AI companies accountable through lawsuits when their systems cause harm.

He even proposed "holding AI agents legally responsible" for accidents or crimes - a concept that would fundamentally change how we think about AI accountability.

H.Hayashi--JT