▎AI & Multi-Agent

Adversarial Prompting

Inputs designed to coerce a language model or agent into unsafe, unauthorized, or false behavior.

Definition

Adversarial Prompting is inputs designed to coerce a language model or agent into unsafe, unauthorized, or false behavior. In defense applications, it targets staff assistants, document analyzers, and tool-using agents through natural language. The hard part is hidden instructions inside apparently benign mission data, especially when systems are deployed across contested links, coalition boundaries, and mixed human-machine teams. KhanBMS treats it as a key reason KhanBMS separates untrusted content from command policy, tying the concept back to modular command, edge execution, and auditable authority.

Reference attributes

Layer: LLM attack method
Operational value: Targets staff assistants, document analyzers, and tool-using agents through natural language
Primary risk: Hidden instructions inside apparently benign mission data
KhanBMS role: A key reason KhanBMS separates untrusted content from command policy

Related terms

#security#llm#agents