Module cartridge.issues

Monitor issues across cluster instances.

Cartridge detects the following problems:

Replication:

critical: “Replication from … to … isn’t running” - when box.info.replication.upstream == nil ;
critical: “Replication from … to … state “stopped”/”orphan”/etc. (…)”;
warning: “Replication from … to …: high lag” - when upstream.lag > box.cfg.replication_sync_lag ;
warning: “Replication from … to …: long idle” - when upstream.idle > 2 * box.cfg.replication_timeout ;

Failover:

warning: “Can’t obtain failover coordinator (…)”;
warning: “There is no active failover coordinator”;
warning: “Failover is stuck on …: Error fetching appointments (…)”;
warning: “Failover is stuck on …: Failover fiber is dead” - this is likely a bug;

Switchover:

warning: “Consistency on … isn’t reached yet”;

Clock:

warning: “Clock difference between … and … exceed threshold” limits.clock_delta_threshold_warning ;

Memory:

critical: “Running out of memory on …” - when all 3 metrics items_used_ratio, arena_used_ratio, quota_used_ratio from box.slab.info() exceed limits.fragmentation_threshold_critical ;
warning: “Memory is highly fragmented on …” - when items_used_ratio > limits.fragmentation_threshold_warning and both arena_used_ratio, quota_used_ratio exceed critical limit;

Configuration:

warning: “Configuration checksum mismatch on …”;
warning: “Configuration is prepared and locked on …”;
warning: “Advertise URI (…) differs from clusterwide config (…)”;
warning: “Configuring roles is stuck on … and hangs for … so far”;

Vshard:

various vshard alerts (see vshard docs for details);
warning: “Group “…” wasn’t bootstrapped: …”;
warning: Vshard storages in replicaset %s marked as “all writable”.
warning: “Cluster has … doubled buckets. Call require(‘cartridge.vshard-utils’).find_doubled_buckets() for details”; You can enable extra vshard issues by setting TARANTOOL_ADD_VSHARD_STORAGE_ALERTS_TO_ISSUES=true/TARANTOOL_ADD_VSHARD_ROUTER_ALERTS_TO_ISSUES=true or with --add-vshard-storage-alerts-to-issues/--add-vshard-router-alerts-to-issues command-line argument. It’s recommended to enable router alerts in production.

Alien members:

warning: “Instance … with alien uuid is in the membership” - when two separate clusters share the same cluster cookie;

Expelled instances:

warning: “Replicaset … has expelled instance … in box.space._cluster” - when instance was expelled from replicaset, but still remains in box.space._cluster;

Deprecated space format:

warning: “Instance … has spaces with deprecated format: space1, …”

Raft issues:

warning: “Raft leader idle is 10.000 on … . Is raft leader alive and connection is healthy?”

Unhealthy replicasets:

critical: “All instances are unhealthy in replicaset … “.

Disk failures:

critical: “Disk error on instance … “.

Disabled instances:

warning: “Instance had Error and was disabled”

Custom issues (defined by user):

Custom roles can announce more issues with their own level, topic and message. See custom-role.get_issues.

GraphQL request:

You can get info about cluster issues using the following GraphQL request:

{
    cluster {
        issues {
            level
            message
            replicaset_uuid
            instance_uuid
            topic
         }
     }
 }

Tables

limits

Thresholds for issuing warnings. All settings are local, not clusterwide. They can be changed with corresponding environment variables ( TARANTOOL_* ) or command-line arguments. See cartridge.argparse module for details.

Fields:

fragmentation_threshold_critical: (number) default: 0.85.
fragmentation_threshold_full: (number) default: 1.0.
fragmentation_threshold_warning: (number) default: 0.6.
clock_delta_threshold_warning: (number) default: 5.

Local Functions

validate_limits (limits)

Validate limits configuration.

Parameters:

limits: (table)

Returns:

(boolean) true

(nil)

(table) Error description

set_limits (limits)

Update limits configuration.