Safety Boundary and Content Inspection

The palyra-safety crate defines the security primitives used to inspect, classify, and transform content at the boundaries of the Palyra ecosystem. It provides a fail-closed architecture for detecting prompt injection, identifying secret leaks, and enforcing trust-based redaction before data reaches a Model Provider or is exported to a user.

Safety Phases and Data Flow

Content inspection occurs at three primary lifecycle stages, defined by the SafetyPhase enum:

PrePrompt: Content is scanned before being assembled into an LLM prompt. This phase focuses on detecting prompt injection and indirect secret references in untrusted data.
PreExecution: Content (typically tool outputs) is scanned before the agent processes it.
Export: Content is scanned before being returned to the user or saved to logs. This phase emphasizes the redaction of actual credentials and internal system paths.

System Data Flow and Safety Interception

The following diagram illustrates how the palyra-safety primitives intercept data moving between the agent loop, the filesystem, and external LLM providers. Title: Safety Boundary Interception Points Sources: crates/palyra-safety/src/lib.rs#1-9, crates/palyra-daemon/src/application/tool_runtime/workspace_file.rs#1-14, crates/palyra-daemon/src/application/tool_runtime/workspace_patch.rs#1-12

Content Inspection and Detection

The core of the safety system is inspect_text, which classifies content into SafetyFindingCategory buckets and recommends a SafetyAction.

Prompt Injection Detection

Palyra uses a set of PatternRule definitions to identify common injection techniques. These are matched against normalized (whitespace-collapsed, lowercase) text.

Rule Category	Examples	Severity
Instruction Level	”ignore previous instructions”, “you are now [role]“	High / Warning
Exfiltration	”reveal the system prompt”, “print secret”	Critical
Spoofing	`<system>`, `[system]`	High / Warning

Sources: crates/palyra-safety/src/lib.rs#16-188

Secret Leak Scanning

The system identifies two types of sensitive material:

Credential Material: Actual tokens, keys, or passwords.
Credential References: Indirect markers like vault_ref or *_file paths.

The is_sensitive_key function in palyra-common provides the heuristic backbone for identifying these keys (e.g., access_token, api_key, client_secret). Sources: crates/palyra-common/src/redaction.rs#15-39, crates/palyra-safety/src/lib.rs#190-205

Trust Labels and Safety Actions

Every piece of content is associated with a TrustLabel, which dictates how strictly it is inspected:

Trusted: System-generated content or verified local files.
Untrusted: Data from external URLs, user input, or unverified tool outputs.

Based on the inspection findings and the SafetyPhase, the system returns a SafetyAction:

Action	Description
Allow	Content is safe to pass through.
Annotate	Content is allowed but marked with a warning finding.
Redact	Sensitive portions are replaced with `<redacted>`.
Block	Content is rejected entirely (fail-closed).

Sources: crates/palyra-safety/src/lib.rs#32-33, crates/palyra-daemon/src/domain/workspace.rs#7-9

Redaction Utilities

Redaction is implemented across several layers to ensure consistency.

Text Redaction

The redact_text_for_export function applies the Redact action by substituting sensitive values with the REDACTED constant ("<redacted>"). This includes:

Header Redaction: Masking sensitive HTTP headers like Authorization or Set-Cookie crates/palyra-common/src/redaction.rs#131-143.
URL Redaction: Removing credentials from userinfo and sensitive keys from query parameters crates/palyra-common/src/redaction.rs#156-179.
Internal Path Redaction: Masking internal storage paths like .palyra/sessions using REDACTED_INTERNAL_RUNTIME_PATH crates/palyra-common/src/redaction.rs#203-211.

Filesystem Redaction (Workspace Tools)

Workspace tools like palyra.fs.read_file and palyra.fs.search integrate safety scanning directly into their output pipeline. Title: Workspace Tool Redaction Flow If a file read is redacted, the WorkspaceReadFileOutput sets redacted: true and text_authoritative: false to prevent the agent from accidentally overwriting the file with the redacted placeholder crates/palyra-daemon/src/application/tool_runtime/workspace_file.rs#118-149. Sources: crates/palyra-daemon/src/application/tool_runtime/workspace_file.rs#8-14, crates/palyra-common/src/redaction.rs#11-13

Fail-Closed Design in Patching

The palyra.fs.apply_patch tool implements a strict fail-closed policy. Before any mutation occurs:

Dry Run: The patch is planned and validated crates/palyra-daemon/src/application/tool_runtime/workspace_patch.rs#6-8.
Secret File Protection: Patches are rejected if they attempt to insert redaction placeholders (like [redacted]) into known secret-bearing files (e.g., .env, id_rsa) crates/palyra-common/src/workspace_patch.rs#164-167.
Containment: All paths are canonicalized and checked against allowed workspace roots. If a path escapes the root, the entire patch fails crates/palyra-common/src/workspace_patch.rs#150-151.

Sources: crates/palyra-common/src/workspace_patch.rs#1-7, crates/palyra-daemon/src/application/tool_runtime/workspace_patch.rs#10-12

​Safety Phases and Data Flow

​System Data Flow and Safety Interception

​Content Inspection and Detection

​Prompt Injection Detection

​Secret Leak Scanning

​Trust Labels and Safety Actions

​Redaction Utilities

​Text Redaction

​Filesystem Redaction (Workspace Tools)

​Fail-Closed Design in Patching