 |
 |
|
SC Conference - Activity Details
HPC Fellowship: Process-Level Fault Tolerance for Job Healing in HPC Environments
Presenter:
|
Chao Wang
(North Carolina State University)
|
Doctoral Research Showcase Session
|
Wednesday, 03:30PM - 04:00PM
|
|
Room 17A/17B
|
Abstract:
As the number of nodes in high-performance computing (HPC)
environments keeps increasing, faults are becoming common
place. Frequently deployed checkpoint/restart mechanisms
generally require a complete restart.
Yet, some node failures can be anticipated by detecting a
deteriorating health status in today's systems, which can
be explored by proactive fault tolerance (FT).
Our work proposes novel, scalable mechanisms in support of proactive FT
and significant enhancements to reactive FT,
including the following contributions: (1) Provide a transparent
job pause service allowing live nodes to
remain active and roll back to the last checkpoint while failed
nodes are dynamically replaced by spares before resuming from the
last checkpoint; (2) Complement reactive with proactive FT by a process-level
live migration mechanism that supports continued execution of an
application during much of migration; (3) Develop incremental checkpointing
techniques to capture only data changed since the last checkpoint to
reduce the cost of reactive FT.
|
|
|