Prometheus metrics

Overview

The Erst daemon exposes Prometheus metrics at the /metrics endpoint when running in daemon mode. These metrics track the health and performance of remote Stellar nodes (Horizon and Soroban RPC endpoints) used during simulation operations.

Accessing metrics

Start the daemon with:

erst daemon --port 8080 --network testnet

Metrics are available at:

http://localhost:8080/metrics

Available metrics

remote_node_last_response_timestamp_seconds

Type: Gauge Description: Unix timestamp (in seconds) of the last successful simulation response from a remote node. Labels:

node_address: The RPC URL or identifier of the remote node (e.g., https://soroban-testnet.stellar.org)
network: The Stellar network (testnet, mainnet, futurenet)

Purpose: This metric enables staleness alerting by tracking when each remote node last successfully responded. The timestamp is only updated on successful responses.

Example queries

Alert when no response received in 60 seconds:

time() - remote_node_last_response_timestamp_seconds{node_address="https://soroban-testnet.stellar.org"} > 60

Alert when any node hasn’t responded in 5 minutes:

time() - remote_node_last_response_timestamp_seconds > 300

Alert when testnet nodes are stale:

time() - remote_node_last_response_timestamp_seconds{network="testnet"} > 120

remote_node_response_total

Type: Counter Description: Total number of simulation responses from remote nodes, labeled by status. Labels:

node_address: The RPC URL or identifier of the remote node
network: The Stellar network (testnet, mainnet, futurenet)
status: Response status (success, error)

Purpose: Track overall node health and error rates over time.

Example queries

Alert when error rate exceeds 10% over 5 minutes:

rate(remote_node_response_total{status="error"}[5m]) / rate(remote_node_response_total[5m]) > 0.1

Total successful responses per node:

sum by (node_address) (remote_node_response_total{status="success"})

Error rate by network:

sum by (network) (rate(remote_node_response_total{status="error"}[5m]))

remote_node_response_duration_seconds

Type: Histogram Description: Duration of simulation requests to remote nodes in seconds. Labels:

node_address: The RPC URL or identifier of the remote node
network: The Stellar network (testnet, mainnet, futurenet)

Buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] Purpose: Identify performance degradation or latency issues with remote nodes.

Example queries

Alert when p95 latency exceeds 5 seconds:

histogram_quantile(0.95, rate(remote_node_response_duration_seconds_bucket[5m])) > 5

Average response time per node:

rate(remote_node_response_duration_seconds_sum[5m]) / rate(remote_node_response_duration_seconds_count[5m])

p99 latency by network:

histogram_quantile(0.99, sum by (network, le) (rate(remote_node_response_duration_seconds_bucket[5m])))

simulation_execution_total

Type: Counter Description: Total number of simulation executions, regardless of remote node involvement. Labels:

status: Execution status (success, error)

Purpose: Track overall system throughput and simulation success rate.

Example queries

Alert when simulation error rate exceeds 5%:

rate(simulation_execution_total{status="error"}[5m]) / rate(simulation_execution_total[5m]) > 0.05

Total simulations per minute:

rate(simulation_execution_total[1m]) * 60

Prometheus configuration

Add the Erst daemon as a scrape target in your prometheus.yml:

scrape_configs:
  - job_name: 'erst-daemon'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Alerting rules

Example Prometheus alerting rules for remote node health:

groups:
  - name: erst_remote_node_health
    interval: 30s
    rules:
      # Alert when a node hasn't responded in 60 seconds
      - alert: RemoteNodeStale
        expr: time() - remote_node_last_response_timestamp_seconds > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Remote node {{ $labels.node_address }} is stale"
          description: "Node {{ $labels.node_address }} on {{ $labels.network }} hasn't responded successfully in {{ $value }} seconds"

      # Alert when a node hasn't responded in 5 minutes (critical)
      - alert: RemoteNodeDown
        expr: time() - remote_node_last_response_timestamp_seconds > 300
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Remote node {{ $labels.node_address }} appears down"
          description: "Node {{ $labels.node_address }} on {{ $labels.network }} hasn't responded successfully in {{ $value }} seconds"

      # Alert when error rate is high
      - alert: RemoteNodeHighErrorRate
        expr: |
          rate(remote_node_response_total{status="error"}[5m]) 
          / 
          rate(remote_node_response_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate for {{ $labels.node_address }}"
          description: "Node {{ $labels.node_address }} has {{ $value | humanizePercentage }} error rate"

      # Alert when latency is high
      - alert: RemoteNodeHighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(remote_node_response_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.node_address }}"
          description: "Node {{ $labels.node_address }} p95 latency is {{ $value }}s"

      # Alert when overall simulation error rate is high
      - alert: SimulationHighErrorRate
        expr: |
          rate(simulation_execution_total{status="error"}[5m]) 
          / 
          rate(simulation_execution_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High simulation error rate"
          description: "Simulation error rate is {{ $value | humanizePercentage }}"

Grafana dashboard

Example Grafana dashboard panels:

Node staleness panel

{
  "title": "Time Since Last Successful Response",
  "targets": [
    {
      "expr": "time() - remote_node_last_response_timestamp_seconds",
      "legendFormat": "{{ node_address }}"
    }
  ],
  "yAxis": {
    "label": "Seconds"
  }
}

Error rate panel

{
  "title": "Remote Node Error Rate",
  "targets": [
    {
      "expr": "rate(remote_node_response_total{status=\"error\"}[5m]) / rate(remote_node_response_total[5m])",
      "legendFormat": "{{ node_address }}"
    }
  ],
  "yAxis": {
    "label": "Error Rate",
    "format": "percentunit"
  }
}

Latency panel

{
  "title": "Remote Node Response Latency (p95)",
  "targets": [
    {
      "expr": "histogram_quantile(0.95, rate(remote_node_response_duration_seconds_bucket[5m]))",
      "legendFormat": "{{ node_address }}"
    }
  ],
  "yAxis": {
    "label": "Seconds"
  }
}

Testing metrics

Manual verification

Start the daemon

erst daemon --port 8080 --network testnet

Trigger simulations

Execute some simulations via RPC calls or CLI commands.

Check metrics

curl http://localhost:8080/metrics | grep remote_node

Verify timestamp updates

# Run multiple times and observe timestamp changes
curl http://localhost:8080/metrics | grep remote_node_last_response_timestamp_seconds

Test staleness detection

Stop simulations and observe that the timestamp remains constant while time() - timestamp increases.

Expected metric output

# HELP remote_node_last_response_timestamp_seconds Unix timestamp of the last successful simulation response from a remote node
# TYPE remote_node_last_response_timestamp_seconds gauge
remote_node_last_response_timestamp_seconds{network="testnet",node_address="https://horizon-testnet.stellar.org/"} 1.709123456e+09
remote_node_last_response_timestamp_seconds{network="testnet",node_address="https://soroban-testnet.stellar.org"} 1.709123457e+09

# HELP remote_node_response_duration_seconds Duration of simulation requests to remote nodes in seconds
# TYPE remote_node_response_duration_seconds histogram
remote_node_response_duration_seconds_bucket{network="testnet",node_address="https://soroban-testnet.stellar.org",le="0.005"} 0
remote_node_response_duration_seconds_bucket{network="testnet",node_address="https://soroban-testnet.stellar.org",le="0.1"} 1
remote_node_response_duration_seconds_bucket{network="testnet",node_address="https://soroban-testnet.stellar.org",le="0.5"} 10
remote_node_response_duration_seconds_bucket{network="testnet",node_address="https://soroban-testnet.stellar.org",le="+Inf"} 20
remote_node_response_duration_seconds_sum{network="testnet",node_address="https://soroban-testnet.stellar.org"} 8.5
remote_node_response_duration_seconds_count{network="testnet",node_address="https://soroban-testnet.stellar.org"} 20

# HELP remote_node_response_total Total number of simulation responses from remote nodes by status
# TYPE remote_node_response_total counter
remote_node_response_total{network="testnet",node_address="https://soroban-testnet.stellar.org",status="success"} 18
remote_node_response_total{network="testnet",node_address="https://soroban-testnet.stellar.org",status="error"} 2

# HELP simulation_execution_total Total number of simulation executions by status
# TYPE simulation_execution_total counter
simulation_execution_total{status="success"} 45
simulation_execution_total{status="error"} 3

Implementation details

The metrics are automatically recorded at the following points:

Remote node responses

Metrics are recorded in internal/rpc/client.go for:

GetTransaction calls to Horizon
GetLedgerEntries calls to Soroban RPC
Other RPC methods that interact with remote nodes

Simulation executions

Metrics are recorded in internal/simulator/runner.go for every simulation run.

Timestamp updates

The remote_node_last_response_timestamp_seconds gauge is only updated on successful responses, ensuring it accurately reflects the last time the node was healthy.

Troubleshooting

Metrics not appearing

Verify the daemon is running: curl http://localhost:8080/health
Check the metrics endpoint: curl http://localhost:8080/metrics
Ensure simulations are being executed (metrics won’t appear until first use)

Timestamp not updating

Verify simulations are succeeding (check logs)
Confirm the node is responding successfully
Check for errors in the daemon logs

High error rates

Check network connectivity to remote nodes
Verify the remote node URLs are correct
Check if the remote nodes are experiencing issues
Review daemon logs for specific error messages

Metrics are exposed in Prometheus exposition format and are compatible with any Prometheus-compatible monitoring system.

Get Started

Core Features

Guides

Configuration

Advanced

Prometheus metrics

Overview

Accessing metrics

Available metrics

remote_node_last_response_timestamp_seconds

Example queries

remote_node_response_total

Example queries

remote_node_response_duration_seconds

Example queries

simulation_execution_total

Example queries

Prometheus configuration

Alerting rules

Grafana dashboard

Node staleness panel

Error rate panel

Latency panel

Testing metrics

Manual verification

Expected metric output

Implementation details

Troubleshooting

​Overview

​Accessing metrics

​Available metrics

​remote_node_last_response_timestamp_seconds

​Example queries

​remote_node_response_total

​Example queries

​remote_node_response_duration_seconds

​Example queries

​simulation_execution_total

​Example queries

​Prometheus configuration

​Alerting rules

​Grafana dashboard

​Node staleness panel

​Error rate panel

​Latency panel

​Testing metrics

​Manual verification

​Expected metric output

​Implementation details

​Troubleshooting

Overview

Accessing metrics

Available metrics

remote_node_last_response_timestamp_seconds

Example queries

remote_node_response_total

Example queries

remote_node_response_duration_seconds

Example queries

simulation_execution_total

Example queries

Prometheus configuration

Alerting rules

Grafana dashboard

Node staleness panel

Error rate panel

Latency panel

Testing metrics

Manual verification

Expected metric output

Implementation details

Troubleshooting