Module cartridge.issues
Monitor issues across cluster instances.
Cartridge detects the following problems:
Replication:
- critical: «Replication from … to … isn’t running» -
when
box.info.replication.upstream == nil
; - critical: «Replication from … to … state «stopped»/»orphan»/etc. (…)»;
- warning: «Replication from … to …: high lag» -
when
upstream.lag > box.cfg.replication_sync_lag
; - warning: «Replication from … to …: long idle» -
when
upstream.idle > 2 * box.cfg.replication_timeout
;
Failover:
- warning: «Can’t obtain failover coordinator (…)»;
- warning: «There is no active failover coordinator»;
- warning: «Failover is stuck on …: Error fetching appointments (…)»;
- warning: «Failover is stuck on …: Failover fiber is dead» - this is likely a bug;
Switchover:
- warning: «Consistency on … isn’t reached yet»;
Clock:
- warning: «Clock difference between … and … exceed threshold»
limits.clock_delta_threshold_warning
;
Memory:
- critical: «Running out of memory on …» - when all 3 metrics
items_used_ratio
,arena_used_ratio
,quota_used_ratio
frombox.slab.info()
exceedlimits.fragmentation_threshold_critical
; - warning: «Memory is highly fragmented on …» - when
items_used_ratio > limits.fragmentation_threshold_warning
and botharena_used_ratio
,quota_used_ratio
exceed critical limit;
Configuration:
- warning: «Configuration checksum mismatch on …»;
- warning: «Configuration is prepared and locked on …»;
- warning: «Advertise URI (…) differs from clusterwide config (…)»;
- warning: «Configuring roles is stuck on … and hangs for … so far»;
Vshard:
- various vshard alerts (see vshard docs for details);
- warning: «Group «…» wasn’t bootstrapped: …»;
- warning: Vshard storages in replicaset %s marked as «all writable».
You can enable extra vshard issues by setting
TARANTOOL_ADD_VSHARD_STORAGE_ALERTS_TO_ISSUES=true/TARANTOOL_ADD_VSHARD_ROUTER_ALERTS_TO_ISSUES=true
or with--add-vshard-storage-alerts-to-issues/--add-vshard-router-alerts-to-issues
command-line argument. It’s recommended to enable router alerts in production.
Alien members:
- warning: «Instance … with alien uuid is in the membership» - when two separate clusters share the same cluster cookie;
Expelled instances:
- warning: «Replicaset … has expelled instance … in box.space._cluster» - when instance was expelled from replicaset, but still remains in box.space._cluster;
Deprecated space format:
- warning: «Instance … has spaces with deprecated format: space1, …»
Raft issues:
- warning: «Raft leader idle is 10.000 on … . Is raft leader alive and connection is healthy?»
Unhealthy replicasets:
- critical: «All instances are unhealthy in replicaset … «.
Disk failures:
- critical: «Disk error on instance … «.
Disabled instances:
- warning: «Instance had Error and was disabled»
Custom issues (defined by user):
- Custom roles can announce more issues with their own level, topic and message. See custom-role.get_issues.
GraphQL request:
You can get info about cluster issues using the following GraphQL request:
{
cluster {
issues {
level
message
replicaset_uuid
instance_uuid
topic
}
}
}
Thresholds for issuing warnings.
All settings are local, not clusterwide. They can be changed with
corresponding environment variables ( TARANTOOL_*
) or command-line
arguments. See cartridge.argparse module for details.
Fields:
- fragmentation_threshold_critical: (number) default: 0.85.
- fragmentation_threshold_full: (number) default: 1.0.
- fragmentation_threshold_warning: (number) default: 0.6.
- clock_delta_threshold_warning: (number) default: 5.