Usage Governance, Diagnostics, and Self-Healing

This section documents the mechanisms Palyra uses to ensure operational stability, cost control, and automated recovery. This includes the budget governance system for LLM spending, the diagnostics infrastructure for system introspection, and the self-healing loop that monitors and remediates background task failures.

1. Usage Governance and Budgeting

Palyra implements a multi-layered governance system to track and limit LLM usage costs. This system operates by intercepting orchestration requests and evaluating them against defined budget policies before dispatching them to model providers.

1.1. Core Governance Entities

The governance logic is primarily implemented in crates/palyra-daemon/src/usage_governance.rs. Key structures include:

RoutingMode: Defines how governance is enforced. Options include Suggest (passive), DryRun (logs outcome but allows execution), and Enforced (blocks execution if limits are exceeded) crates/palyra-daemon/src/usage_governance.rs#24-28.
UsageBudgetEvaluation: Represents the result of checking a request against a specific policy, including consumed vs. projected values and soft/hard limits crates/palyra-daemon/src/usage_governance.rs#88-105.
RoutingDecision: The final output of the governance engine, determining if a run is blocked or requires manual approval crates/palyra-daemon/src/usage_governance.rs#108-126.

1.2. Smart Routing and Cost Estimation

Before a Run begins, the system calculates a PricingEstimate based on the estimated token count of the prompt crates/palyra-daemon/src/usage_governance.rs#65-73. The plan_usage_routing function is the central entry point for this logic. It is called during the run stream initialization to determine if the requested model and parameters align with the effective SmartRoutingRuntimeConfig crates/palyra-daemon/src/application/run_stream/orchestration.rs#31-32 crates/palyra-daemon/src/usage_governance.rs#50-54.

1.3. Usage Governance Flow

The following diagram illustrates how a message route request is governed before reaching the LLM. Diagram: Governance and Routing Flow Sources: crates/palyra-daemon/src/usage_governance.rs#201-213, crates/palyra-daemon/src/application/run_stream/orchestration.rs#146-182

2. Diagnostics and System Doctor

Palyra provides deep introspection into the daemon’s state through gRPC and HTTP diagnostics endpoints, complemented by a CLI-based “Doctor” for environment repair.

2.1. Console Diagnostics

The /console/v1/diagnostics handler aggregates snapshots from every major subsystem. It collects:

Model Provider Status: Connectivity and circuit breaker state crates/palyra-daemon/src/transport/http/handlers/console/diagnostics.rs#11-15.
Auth Profiles: Status of identity and credential providers crates/palyra-daemon/src/transport/http/handlers/console/diagnostics.rs#16-20.
Memory Maintenance: Statistics on vector DB usage and TTL vacuuming schedules crates/palyra-daemon/src/transport/http/handlers/console/diagnostics.rs#54-56.
Observability: Aggregated recent failures across connectors and browsers crates/palyra-daemon/src/transport/http/handlers/console/diagnostics.rs#62-63.

2.2. CLI Doctor and Recovery

The palyra doctor command is the primary tool for troubleshooting and repairing the local installation. It is implemented in crates/palyra-cli/src/commands/doctor/recovery.rs. The doctor operates in several modes: Diagnostics, RepairPreview, and RepairApply crates/palyra-cli/src/commands/doctor/recovery.rs#68-74. It can perform automated fixes such as:

Reinitializing missing configuration files crates/palyra-cli/src/commands/doctor/recovery.rs#179-181.
Normalizing corrupted auth registries crates/palyra-cli/src/commands/doctor/recovery.rs#195-198.
Backfilling missing access registry entries crates/palyra-cli/src/commands/doctor/recovery.rs#234-236.

Diagram: CLI Doctor to Daemon Diagnostics Bridge Sources: crates/palyra-cli/src/commands/doctor.rs#8-10, crates/palyra-daemon/src/transport/http/handlers/console/diagnostics.rs#6-15

3. Self-Healing and Background Loops

The self-healing system ensures that long-running background tasks and orchestration runs are monitored for “stalling” or silent failures.

3.1. Work Heartbeats

The daemon tracks active work via the WorkHeartbeat mechanism. Any significant background operation must record heartbeats to avoid being flagged as an incident crates/palyra-daemon/src/self_healing.rs#1-10.

WorkHeartbeatKind: Categorizes the work (e.g., Run, BackgroundTask, CronJob) crates/palyra-daemon/src/self_healing.rs#12-18.
Heartbeat Recording: Components call record_self_healing_heartbeat periodically during execution crates/palyra-daemon/src/background_queue.rs#107-111.

3.2. Background Queue Supervision

The spawn_background_queue_loop in crates/palyra-daemon/src/background_queue.rs manages the lifecycle of asynchronous tasks. It monitors for:

Expirations: Tasks that failed to start before their expires_at_unix_ms crates/palyra-daemon/src/background_queue.rs#114-116.
Cancellations: Propagating parent run cancellations to child background tasks crates/palyra-daemon/src/background_queue.rs#145-147.
Terminal State Finalization: Moving tasks to completed, failed, or cancelled based on the outcome of their target runs crates/palyra-daemon/src/background_queue.rs#157-158.

3.3. Self-Healing Incident Remediation

When a heartbeat is missed for a defined threshold, the self-healing logic can trigger remediation:

Run Cancellation: If an orchestration run hangs, the self-healing loop transitions it to Cancelled and clears its heartbeat crates/palyra-daemon/src/application/run_stream/cancellation.rs#16-33.
Task Re-dispatch: Background tasks that stall in a running state without heartbeats may be marked for retry or failure crates/palyra-daemon/src/background_queue.rs#185-190.

Diagram: Self-Healing Heartbeat Loop Sources: crates/palyra-daemon/src/background_queue.rs#107-111, crates/palyra-daemon/src/application/run_stream/cancellation.rs#33-40, crates/palyra-daemon/src/self_healing.rs#1-10

4. Web Console Integration

The Web Console provides a dedicated Operations Section for monitoring these systems.

Diagnostics View: Displays the model provider state, auth profile status, and browser service health apps/web/src/console/sections/OperationsSection.tsx#112-129.
Self-Healing Dashboard: Visualizes active incidents, recent remediation attempts, and heartbeat status apps/web/src/console/sections/OperationsSection.tsx#143-147.
Usage Insights: Summarizes total spending, model mix, and active usage alerts apps/web/src/console/sections/OperationsSection.tsx#137-141.

Sources: apps/web/src/console/sections/OperationsSection.tsx#47-158

​1. Usage Governance and Budgeting

​1.1. Core Governance Entities

​1.2. Smart Routing and Cost Estimation

​1.3. Usage Governance Flow

​2. Diagnostics and System Doctor

​2.1. Console Diagnostics

​2.2. CLI Doctor and Recovery

​3. Self-Healing and Background Loops

​3.1. Work Heartbeats

​3.2. Background Queue Supervision

​3.3. Self-Healing Incident Remediation

​4. Web Console Integration