Monitor the System Health
The system health is monitored using a collection of JumpScripts, maintained in a private 0-complexity/selfhealing GitHub repository. Check this private repository in order to get a current view on all JumpScripts. Below is just a snapshot in time.
Also check the JumpScript page in the Grid Portal where you can filter on monitor.health to see a list of the JumpScripts actually available on your environment:
In order to check the actual system health go to the Status Overview page in the Grid Portal, also reachable by clicking the green, orange or red colored bullet in the top navigation bar:
For more information about the Status Overview page go to the dedicated section here.
In what follows you get overview of all JumpScripts, organized in the same way (sections) as on the Node Status page.
Depending on the type of node, following sections are available:
Section | Master Node | CPU Node | Storage Node |
---|---|---|---|
AYS Process | X | X | X |
Databases | X | ||
Disks | X | X | X |
JSAgent | X | X | X |
Network | X | ||
Orphanage | X | X | |
Redis | X | X | X |
System Load | X | X | X |
Temperature | X | X | X |
Workers | X | X | X |
Hardware | X | X | |
Node Status | X | X | |
Deployment Test | X | ||
OVS Services | X |
AYS Process
- ays_process_check.py checks if all AYS processes are running. Throws an error condition for each process that is not running
System Load
- cpu_ctxpy_check.py checks the number of CPU context switches per second. If higher than expected an error condition is thrown
- cpu_interrupts_check.py checks the number of interrupts per second. If higher than expected an error condition is thrown
- cpu_mem_core_check.py checks memory and CPU usage/load. If average per hour is higher than expected an error condition is thrown
- openfd_check.py checks the number of open file descriptors for each process
- swap_used_check.py checks the amount of swap used by the system
- threads_check.py checks the number of threads, and throw an error if higher than expected
Databases
- db_check.py checks status of MongoDB and InfluxDB databases on Master. If not running an error condition is thrown.
Orphanage
- disk_orphan.py checks for orphan disks on volume driver nodes. Generates warning if orphan disks exist on the specified volumes. Is scheduled by disk_orphan_schedule.py, running on the master. Throws an error condition for each orphan disk found
- vm_orphan.py checks if libvirt still has VMs that are not known by the system
Disks
- disk_usage_check.py checks status of all physical disks and partitions on all nodes, reporting back the free disk space on mount points. Throws error condition for each disk that is almost (>90%) full
Hardware
- fan_check.py checks the fans of a node using IPMItool
- networkbond_check.py monitors if a network bond (if there is one) has both (or more) interfaces properly active
- psu_check.py checks the power redundancy of a node using IPMItool
- raid_check.py checks whether all configured RAID devices are still healthy
Bandwidth Test
- networkperformance.py tests bandwidth between storage nodes, volume drivers and itself
OpenvStorage
- ovs_healthcheck.py calls the standard Open vStorage health checks, see: https://github.com/openvstorage/openvstorage-health-check
OVS Services
- ovsstatus.py checks every predefined period (default 60 seconds) if all OVS processes are still run
Deployment Test
- deployment_test.py tests every predefined period (default 30 minutes) whether test VM exists and if exists it tests write speed. Every 24hrs, test VM is recreated
Network
- publicipswatcher.py checks the status of the available public IPs
- routeros_check.py checks the status of RouterOS. (scheduled by routeros_check_schedule.py)
Redis
- redis_usage_check.py checks Redis server status
Stack Status
- nodestatus.py checks the status of each stack (CPU node)
Temperature
- temp_check.py checks the CPU + disk temperature of the system
Workers
- workerstatus_check.py monitors the workers, checking if they report back on regular basis report to their agent for new tasks