This solution addresses the challenge of identifying zombie jobs in HPC environments through an automated and adaptive two-stage approach.

First, the method monitors running jobs to detect abnormal execution patterns using advanced anomaly-detection techniques. It identifies processes that show typical zombie-job symptoms, such as stalled progress or irregular resource usage, enabling consistent and early detection.

Second, the solution incorporates an automated mechanism that evaluates and safely terminates jobs confirmed as zombies. By integrating decision logic based on system policies and historical behavior, it ensures reliable remediation while avoiding unintended job interruption.

Overall, this approach provides a robust strategy for mitigating zombie jobs, improving system performance and resource utilization, and contributing to a smoother HPC user experience.

 

Status: submitted to the European Patent Office