Skip to main content
This section documents the mechanisms Palyra uses to ensure operational stability, cost control, and automated recovery. This includes the budget governance system for LLM spending, the diagnostics infrastructure for system introspection, and the self-healing loop that monitors and remediates background task failures.

1. Usage Governance and Budgeting

Palyra implements a multi-layered governance system to track and limit LLM usage costs. This system operates by intercepting orchestration requests and evaluating them against defined budget policies before dispatching them to model providers.

1.1. Core Governance Entities

The governance logic is primarily implemented in crates/palyra-daemon/src/usage_governance.rs. Key structures include:

1.2. Smart Routing and Cost Estimation

Before a Run begins, the system calculates a PricingEstimate based on the estimated token count of the prompt crates/palyra-daemon/src/usage_governance.rs#65-73. The plan_usage_routing function is the central entry point for this logic. It is called during the run stream initialization to determine if the requested model and parameters align with the effective SmartRoutingRuntimeConfig crates/palyra-daemon/src/application/run_stream/orchestration.rs#31-32 crates/palyra-daemon/src/usage_governance.rs#50-54.

1.3. Usage Governance Flow

The following diagram illustrates how a message route request is governed before reaching the LLM. Diagram: Governance and Routing Flow Sources: crates/palyra-daemon/src/usage_governance.rs#201-213, crates/palyra-daemon/src/application/run_stream/orchestration.rs#146-182

2. Diagnostics and System Doctor

Palyra provides deep introspection into the daemon’s state through gRPC and HTTP diagnostics endpoints, complemented by a CLI-based “Doctor” for environment repair.

2.1. Console Diagnostics

The /console/v1/diagnostics handler aggregates snapshots from every major subsystem. It collects:

2.2. CLI Doctor and Recovery

The palyra doctor command is the primary tool for troubleshooting and repairing the local installation. It is implemented in crates/palyra-cli/src/commands/doctor/recovery.rs. The doctor operates in several modes: Diagnostics, RepairPreview, and RepairApply crates/palyra-cli/src/commands/doctor/recovery.rs#68-74. It can perform automated fixes such as: Diagram: CLI Doctor to Daemon Diagnostics Bridge Sources: crates/palyra-cli/src/commands/doctor.rs#8-10, crates/palyra-daemon/src/transport/http/handlers/console/diagnostics.rs#6-15

3. Self-Healing and Background Loops

The self-healing system ensures that long-running background tasks and orchestration runs are monitored for “stalling” or silent failures.

3.1. Work Heartbeats

The daemon tracks active work via the WorkHeartbeat mechanism. Any significant background operation must record heartbeats to avoid being flagged as an incident crates/palyra-daemon/src/self_healing.rs#1-10.

3.2. Background Queue Supervision

The spawn_background_queue_loop in crates/palyra-daemon/src/background_queue.rs manages the lifecycle of asynchronous tasks. It monitors for:
  1. Expirations: Tasks that failed to start before their expires_at_unix_ms crates/palyra-daemon/src/background_queue.rs#114-116.
  2. Cancellations: Propagating parent run cancellations to child background tasks crates/palyra-daemon/src/background_queue.rs#145-147.
  3. Terminal State Finalization: Moving tasks to completed, failed, or cancelled based on the outcome of their target runs crates/palyra-daemon/src/background_queue.rs#157-158.

3.3. Self-Healing Incident Remediation

When a heartbeat is missed for a defined threshold, the self-healing logic can trigger remediation: Diagram: Self-Healing Heartbeat Loop Sources: crates/palyra-daemon/src/background_queue.rs#107-111, crates/palyra-daemon/src/application/run_stream/cancellation.rs#33-40, crates/palyra-daemon/src/self_healing.rs#1-10

4. Web Console Integration

The Web Console provides a dedicated Operations Section for monitoring these systems. Sources: apps/web/src/console/sections/OperationsSection.tsx#47-158