Leonardo: Node Communication Issues

  1. /
  2. HPC Center news
  3. /
  4. Leonardo: Node Communication Issues

Dear Users,

We regret to inform you that the mitigation measures applied did not work as expected.

During the night, issues were detected with the subnet manager, leading to the reappearance of node communication problems and the subsequent removal of several nodes from the production partitions. Unfortunately, as a consequence, many running jobs failed.

As a result, the system is currently experiencing instability, and a significant number of nodes are temporarily unavailable for production.

Our teams are actively investigating and working to resolve the issue. As soon as further information becomes available and the system state is stabilized, we will promptly inform all users.

We sincerely apologize for the inconvenience caused and thank you for your understanding and patience.

Best regards,
HPC User Support @ CINECA