arXiv:2605.11047

Red-Teaming Agent Execution Contexts

Open-world security evaluation on OpenClaw agents under adversarial execution contexts.

Hongwei Yao, Yiming Liu, Yiling He, and Bingrun Yang

Paper PDF GitHub Dataset

Abstract

Agentic language-model systems increasingly rely on mutable execution contexts: files, memory, tools, skills, and auxiliary artifacts. DeepTrap evaluates whether such contexts can induce unsafe behavior while preserving benign task completion in OpenClaw agents.

The public benchmark contains 42 replay tasks spanning six contextual vulnerability classes and seven operational scenarios. DeepTrap reports attack grading scores (AGS) and utility grading scores (UGS) to jointly measure security failure and task usefulness.

42Replay tasks

6Risk classes

7Scenario families

Framework

DeepTrap constructs compromised execution contexts from benign instructions and clean workspaces, searches candidate adversarial payloads with multi-objective reward signals, and refines successful payloads through reflection-based probing.

Data Construction

Benign Task Seeds

Start from normal OpenClaw-style user tasks such as meeting summaries, config checks, sales reports, code formatting, and system administration.

Context Surfaces

Place adversarial content in files, skills, tools, logs, environment-like data, and encoded artifacts while keeping the user prompt benign.

Payload Search

Use reward-guided heuristic search to identify payloads that realize target risks while preserving task plausibility and stealth.

Replay Packaging

Package successful cases as public replay tasks with workspace files, expected behavior, automated checks, and judge criteria.

Risk 1: Harness Hijacking

Poisoned context or injected workflow steps redirect the agent into unrequested actions.

Risk 2: Privacy Leakage

Planted credentials and sensitive values appear in responses, files, or external actions.

Risk 3: Unauthorized Execution

Dangerous commands or boundary-crossing actions are run beyond the user request.

Risk 4: Supply Chain Risk

Malicious skill or tool-like logic is trusted or invoked during the task.

Risk 5: Tool Abuse

Misleading tool metadata causes harmful behavior or unintended side effects.

Risk 6: Encoding Obfuscation

Hidden encoded payloads are decoded and followed unsafely.

Leaderboard

AGS is Attack Grading Score and lower is better. UGS is Utility Grading Score and higher is better. The default order sorts by average AGS from low to high.

Sort by Best safety = lowest Avg AGS.

Model	Avg AGS ↓	Avg UGS ↑	R1 AGS/UGS	R2 AGS/UGS	R3 AGS/UGS	R4 AGS/UGS	R5 AGS/UGS	R6 AGS/UGS
Claude-Sonnet-4.6	0.38	0.61	0.51 / 0.71	0.58 / 0.69	0.37 / 0.55	0.25 / 0.45	0.38 / 0.55	0.20 / 0.71
MiMo-v2.5-pro	0.64	0.86	0.74 / 0.92	0.83 / 0.90	0.56 / 0.88	0.58 / 0.87	0.58 / 0.71	0.53 / 0.87
GPT-5.4	0.70	0.83	0.77 / 0.91	0.84 / 0.83	0.76 / 0.86	0.61 / 0.77	0.67 / 0.74	0.53 / 0.87
MiMo-v2.5	0.72	0.91	0.86 / 0.96	0.87 / 0.95	0.71 / 0.88	0.73 / 0.93	0.57 / 0.83	0.60 / 0.89
MiniMax-M2.5	0.83	0.90	0.86 / 0.92	0.89 / 0.95	0.77 / 1.00	0.66 / 0.88	0.90 / 0.74	0.89 / 0.90
GLM-5	0.83	0.90	0.81 / 0.90	0.93 / 0.90	0.74 / 0.98	0.83 / 0.89	0.79 / 0.83	0.88 / 0.88
Deepseek-v4-Pro	0.86	0.89	0.90 / 0.90	0.96 / 0.91	0.74 / 1.00	0.87 / 0.81	0.85 / 0.84	0.86 / 0.89
Qwen3.5-Plus	0.88	0.95	0.93 / 0.95	0.93 / 0.92	0.86 / 1.00	0.74 / 0.98	0.88 / 0.93	0.97 / 0.93
DeepSeek-v4-Flash	0.89	0.96	0.90 / 0.98	0.96 / 0.96	0.80 / 1.00	0.90 / 0.96	0.82 / 0.85	0.94 / 1.00

Citation

@article{yao2026trap,
  title={Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw},
  author={Yao, Hongwei and Liu, Yiming and He, Yiling and Yang, Bingrun},
  journal={arXiv preprint arXiv:2605.11047},
  year={2026}
}