研究生(外文):Min-Chang Hsieh
論文名稱(外文):Checkpointing MPI Applications on Mulitprocessors Using SMPCkpt
指導教授(外文):Chyi-Ren Dow
外文關鍵詞:MPISMPCheckpointingRollback RecoveryFault ToleranceAutomatic Fault Detection and Restart
許多不同的研究與應用常常需要利用電腦的強大運算能力,來幫助使用者解決問題。目前,有很多的研究學者利用了在MPI訊息傳遞介面標準下所建構的平行計算環境,來幫助並且滿足他們的電腦運算需求。整體而言,一個平行程式的檢查點設定工作比一般的循序程式更為複雜。檢查點的設定主要可以提供在回溯與復原 (容錯能力)、記錄後重新模擬執行行為的除錯設計、程式執行的轉移以及工作替換等等的應用之上。
本論文是針對SMP多重處理器環境去設計、發展、與實作一套高效能的自動復原檢查點系統,名為SMPCkpt。主要的目標是在Sun Enterprise 10000的平台上開發一個以MPI為基礎的整合式自動復原檢查點系統。這個針對MPI平行程式的所設計的檢查點系統將會包含一些功能,例如協調式的檢查點設定、非協調式的檢查點設定,以及應用既有的fork checkpoint以及壓縮等來提升檢查點設定的效能,此外也提供高透通性的檢查點設定與自動錯誤復原等等機制。在實驗方面,我們將以不同類型的測試程式組來驗證SMPCkpt的正確性以及進行各項功能的效能分析。
Researchers from many different areas have demands for computational power to solve their specific problems. Today, many are successfully using implementations of MPI, a standard message passing interface, on parallel and distributed environments to satisfy their computational requirement. In general, checkpointing a parallel/distributed program is more complex than checkpointing a sequential program. Checkpointing provides the backbone for rollback recovery (fault-tolerance), playback debugging, process migration and job swapping.
This work proposes to investigate techniques to design, develop, and implement a checkpointing system for symmetric multiprocessors environments, named SMPCkpt. The goal is to develop an integrated checkpointing system for MPI-based applications on Sun Enterprise 10000. The system will support a range of facilities, including fork, compressed checkpointing and transparent checkpointing facilities, and the abilities of automatic fault detection and restart. Experiments will be conducted to obtain performance results of various checkpointing techniques.
摘要 I
Abstract II
Table of Contents III
List of Figures VI
List of Tables VIII
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Overview of the Research 6
1.3 Thesis Organization 10
Chapter 2 Related Work 11
2.1 MPI Standard and Implementations 11
2.2 Checkpointing Algorithms 11
2.2.1 Coordinated Checkpointing 12
2.2.3 Uncoordinated Checkpointing 14
2.3 Optimizations Checkpointing 15
2.3.1 Forked Checkpointing 15
2.3.2 Compressed Checkpointing 16
2.4 Checkpointing Systems 16
2.4.1 Libckpt 16
2.4.2 Ickp 17
2.4.3 CLIP 17
2.4.4 UTCP 17
2.4.5 Chkpt 18
2.4.6 MPICkpt 18
2.4.6 Winckp 19
2.2.7 CoCheck 19
2.2.8 RENEW 19
2.2.9 Comparisons of Checkpointing Systems 20
Chapter 3 System Architecture 21
3.1 Overview 21
3.2 Checkpointing Module 22
3.3 Fault Detection Module 23
3.4 Rollback and Recovery Module 23
Chapter 4 A Non-blocking Coordinated Checkpointing Scheme 25
4.1 Non-blocking Coordinated Checkpointing Algorithm 25
4.2 Proof of correctness 28
Chapter 5 Implementation 32
5.1 Transparency Facility 32
5.2 Checkpointing Library 33
5.2.1 Flushing messages from channels 35
5.2.2 Polling messages in MPICH’s buffer 36
5.2.3 Saving process’ states into disk 38
5.3 Recovery 40
5.4 SMPCkpt Prototype 41
Chapter 6 Experimental Results 43
6.1 NBODY 43
6.2 FFTW 46
6.3 NPB Benchmark 50
Chapter 7 Conclusions 56
References 57
APPENDIX A Condensed Chinese Version 61
第一章 序論 62
第二章 相關研究 64
第三章 系統架構 67
第四章 系統實作 69
第五章 實驗結果 74
第六章 結論 76
