跳到主要內容

臺灣博碩士論文加值系統

(18.97.14.81) 您好!臺灣時間:2025/02/19 04:02
字體大小: 字級放大   字級縮小   預設字形  
回查詢結果 :::

詳目顯示

: 
twitterline
研究生:謝明璋
研究生(外文):Min-Chang Hsieh
論文名稱:使用SMPCkpt來設置多重處理器環境上MPI程式的檢查點
論文名稱(外文):Checkpointing MPI Applications on Mulitprocessors Using SMPCkpt
指導教授:竇其仁竇其仁引用關係
指導教授(外文):Chyi-Ren Dow
學位類別:碩士
校院名稱:逢甲大學
系所名稱:資訊工程學系
學門:工程學門
學類:電資工程學類
論文種類:學術論文
論文出版年:2000
畢業學年度:88
語文別:中文
論文頁數:84
中文關鍵詞:MPISMP回溯復原容錯自動復原檢查點設置
外文關鍵詞:MPISMPCheckpointingRollback RecoveryFault ToleranceAutomatic Fault Detection and Restart
相關次數:
  • 被引用被引用:1
  • 點閱點閱:191
  • 評分評分:
  • 下載下載:7
  • 收藏至我的研究室書目清單書目收藏:0
許多不同的研究與應用常常需要利用電腦的強大運算能力,來幫助使用者解決問題。目前,有很多的研究學者利用了在MPI訊息傳遞介面標準下所建構的平行計算環境,來幫助並且滿足他們的電腦運算需求。整體而言,一個平行程式的檢查點設定工作比一般的循序程式更為複雜。檢查點的設定主要可以提供在回溯與復原 (容錯能力)、記錄後重新模擬執行行為的除錯設計、程式執行的轉移以及工作替換等等的應用之上。
本論文是針對SMP多重處理器環境去設計、發展、與實作一套高效能的自動復原檢查點系統,名為SMPCkpt。主要的目標是在Sun Enterprise 10000的平台上開發一個以MPI為基礎的整合式自動復原檢查點系統。這個針對MPI平行程式的所設計的檢查點系統將會包含一些功能,例如協調式的檢查點設定、非協調式的檢查點設定,以及應用既有的fork checkpoint以及壓縮等來提升檢查點設定的效能,此外也提供高透通性的檢查點設定與自動錯誤復原等等機制。在實驗方面,我們將以不同類型的測試程式組來驗證SMPCkpt的正確性以及進行各項功能的效能分析。
Researchers from many different areas have demands for computational power to solve their specific problems. Today, many are successfully using implementations of MPI, a standard message passing interface, on parallel and distributed environments to satisfy their computational requirement. In general, checkpointing a parallel/distributed program is more complex than checkpointing a sequential program. Checkpointing provides the backbone for rollback recovery (fault-tolerance), playback debugging, process migration and job swapping.
This work proposes to investigate techniques to design, develop, and implement a checkpointing system for symmetric multiprocessors environments, named SMPCkpt. The goal is to develop an integrated checkpointing system for MPI-based applications on Sun Enterprise 10000. The system will support a range of facilities, including fork, compressed checkpointing and transparent checkpointing facilities, and the abilities of automatic fault detection and restart. Experiments will be conducted to obtain performance results of various checkpointing techniques.
摘要 I
Abstract II
Table of Contents III
List of Figures VI
List of Tables VIII
List of Tables VIII
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Overview of the Research 6
1.3 Thesis Organization 10
Chapter 2 Related Work 11
2.1 MPI Standard and Implementations 11
2.2 Checkpointing Algorithms 11
2.2.1 Coordinated Checkpointing 12
2.2.3 Uncoordinated Checkpointing 14
2.3 Optimizations Checkpointing 15
2.3.1 Forked Checkpointing 15
2.3.2 Compressed Checkpointing 16
2.4 Checkpointing Systems 16
2.4.1 Libckpt 16
2.4.2 Ickp 17
2.4.3 CLIP 17
2.4.4 UTCP 17
2.4.5 Chkpt 18
2.4.6 MPICkpt 18
2.4.6 Winckp 19
2.2.7 CoCheck 19
2.2.8 RENEW 19
2.2.9 Comparisons of Checkpointing Systems 20
Chapter 3 System Architecture 21
3.1 Overview 21
3.2 Checkpointing Module 22
3.3 Fault Detection Module 23
3.4 Rollback and Recovery Module 23
Chapter 4 A Non-blocking Coordinated Checkpointing Scheme 25
4.1 Non-blocking Coordinated Checkpointing Algorithm 25
4.2 Proof of correctness 28
Chapter 5 Implementation 32
5.1 Transparency Facility 32
5.2 Checkpointing Library 33
5.2.1 Flushing messages from channels 35
5.2.2 Polling messages in MPICH’s buffer 36
5.2.3 Saving process’ states into disk 38
5.3 Recovery 40
5.4 SMPCkpt Prototype 41
Chapter 6 Experimental Results 43
6.1 NBODY 43
6.2 FFTW 46
6.3 NPB Benchmark 50
Chapter 7 Conclusions 56
References 57
APPENDIX A Condensed Chinese Version 61
第一章 序論 62
第二章 相關研究 64
第三章 系統架構 67
第四章 系統實作 69
第五章 實驗結果 74
第六章 結論 76
[1] A. Acharya and B. R. Badrinath, "Checkpointing Distributed Applications on Mobile Computers," Proc. Third Int''l Conf. Parallel and Distributed Information Systems, pp.73-80, September 1994.
[2] J. M. Adamo, "ARCH, An Object-Oriented Library for Asynchronous and Loosely Synchronous System Programming," Tech. Rep. ncstrl.cornell.tc/95-228, Institution Cornell University, Theory Center, 1995.
[3] S. Alagar and S. Venkatesan, "Causal Ordering in Distributed Mobile Systems," IEEE Transactions on Computers, Vol. 46, No. 3, pp. 353-361, March 1997.
[4] NASA Ames Research Center. NAS Parallel Benchmarks. http://science.nas.nasa.gov/Software/NPB, 1997
[5] R. Baldoni, J. M. Helary, A. Mostefaoui, M. Raynal, "On Modeling Consistent Checkpoints and the Domino Effect in Distributed Systems," Tech. Rep. RR-2564, IRISA, July 1995.
[6] G. Barigazzi, L. Strigini. "Application-Transparent Setting of Recovery Point", Proc. 13th Fault-Tolerant computing Symposium, FTCS-13, pp. 48-55,1993.
[7] R. Baldoni, J. M. Helary, A. Mostefaoui, M. Raynal, "On Modeling Consistent Checkpoints and the Domino Effect in Distributed Systems," Tech. Rep. RR-2564, IRISA, July 1995.
[8] A. Beguelin, E. Seligman, P. Stephan, "Application Level Fault Tolerance in Heterogeneous Networks of Workstations," Tech. Rep. CMU-CS-96-157, School of Computer Science, Carnegie Mellon University, Pittsburgh, 1996.
[9] G. Burns, R. Daoud, and J. Vaigl, "LAM: An Open Cluster Environment for MPI," Ohio Supercomputer Center, 1994.
[10] P. E. Chung, Y. Huang, and D. Liang. "Winckp: a Transprent Checkpointing and Rollback Recovery Tool for Windows NT Applicatons." Proceedings of the 29th IEEE Fault-Tolerant Computing Symposium (FTCS-29), 1999.
[11] A. Clematis, G. Deconinck, and V.Gianuzzi. "A Flexible State-saving Library for Message-passing Systems," Proceedings of the 28th IEEE Fault-Tolerant Computing Symposium (FTCS-28), Germany, June 1998.
[12] C. R. Dow, J. S. Chen, J. C. Chen, M.C. Hsieh, "A Transparent Checkpointing System for MPI," Proceedings of National Computer Symposium, PP. C-289-296, 1999
[13] C. R. Dow and Y. G. Gou, "A Parallel/Distributed Debugger for MPI," Proceedings of 1997 Workshop on Distributed System Techniques and Applications, Tainan, Taiwan, May 1997.
[14] W. R. Dieter, J. E. Lumpp. "A User-level Checkpointing Library for POSIX Threads Programs." Proceedings of the 29th IEEE Fault-Tolerant Computing Symposium (FTCS-29), 1999.
[15] E. N. Elnozahy, D. B. Johnson, and Y.M. Wang, "A Survey of Rollback Protocols in Message-Passing Systems," Tech. Rep. CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, October 1997.
[16] J. Fowler and W. Zwacenepoel, "Causal Distributed Breakpoints," 10th International Conference on Distributed Computing Systems, pp. 134-141, 1990.
[17] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, "PVM 3 User''s Guide and Reference Manual," Oak Ridge National Laboratory, May 1993.
[18] W. Gropp and E. Lusk, "Creating a New MPICH Device Using the Channel Interface", Technical Report ANL/MCS-TM-213, Argonne National Laaboratory, 1995.
[19] W. Gropp and E. Lusk, "MPICH: A Case Study in the Dissemination of a Portable Environment for Parallel Scientific Computing," IJSA, 11(2):103--114, Summer 1997.
[20] V. Herrarte, E. Lusk, "Studying Parallel Program Behavior with Upshot," Tech. Rep. ANL - 91/15, Argonne National Laboratory, Argonne, IL 60439, 1991.
[21] R. T. Hood, "The p2d2 Project: Building a Portable Debugger," Proceedings of SPDT''96: SIGMETRICS Symposium on Parallel and Distributed Tools, May 1996.
[22] R. Koo and S. Toueg, "Checkpointing and Rollback Recovery for Distributed Systems," IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, pp. 23-31, January 1987.
[23] L. Lamport, "Time, Clock and the Ordering of Events in Distributed Systems," Comm. ACM, Vol. 21, No. 7, pp. 558-565, July 1978.
[24] D. Libes, "X Wrapper for Non-Graphic Interactive Programs," Proceedings of X hibition 94, San Jose, California, June 1994.
[25] C. M. Lin and C. R. Dow, "Efficient Independent Checkpointing Techniques for Message-Passing Programs," Tech. Rep., Department of Information Engineering and Computer Science, Feng-Chia University, January 1998.
[26] J. Long, W. K. Fuchs, K.A. Abraham. "Compiler-Assisted Static Checkpoint Insertion", Proc. 22th Fault-Tolerant Computing Symposium, FTCS-22, pp. 58-65, 1992.
[27] D. Manivannan and M. Singhal, "A Low-Overhead Recovery Technique Using Quasi-Synchronous Checkpointing," In Proc. IEEE, Int. Conf. Distributed Computer System, pp. 100-107, 1996.
[28] D. Manivannan, R. Netzer, and M. Singal, "Finding Consistent Global Checkpoints in a Distributed Computation," IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 6, pp. 623-627, June 1997.
[29] P. McGrath, B. Tangney. "Scrabble - A Distributed Application with an Emphasis on Continuity", Software Engineering Journal, pp. 160-164, May 1990.
[30] R. H. B. Netzer and J. Xu, "Necessary and Sufficient Conditions for Consistent Global Snapshots," IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 2, pp.165-169, February 1995.
[31] N. Neves and W. K. Fuchs. "RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols," Proceedings of the 28th IEEE Fault-Tolerant Computing Symposium (FTCS), Germany, June 1998.
[32] J. S. Plank, Y. Chen, and K. Li, "CLIP: A Checkpointing Tool for Message-Passing Parallel Programs," Tech. Rep., Department of Computer Science, University of Princeton, 1996.
[33] J. S. Plank, M. Beck and G. Kingsley, "Libckpt: Transparent Checkpointing Under Unix," USENIX Winter 1995 Technical Conference, New Orleans, Louisiana, January 16-20, 1995.
[34] J. S. Plank and K. Li, "Performance Results of Ickp -- A Consistent Checkpointer on the iPSC/860," Scalable High Performance Computing Conference, pp. 686-693, Knoxville, TN, May, 1994.
[35] J. S. Plank, "Efficient Checkpointing on MIMD Architectures," PhD thesis, Princeton University, January, 1993.
[36] R. Prakash and M. Singhal, "Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems," IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 10, pp. 1035-1048, October 1996.
[37] G. Stellner, "CoCheck: Checkpointing and Process Migration for MPI," 10th International Parallel Processing Symposium, April 1996.
[38] N. H. Vaidya, "Staggered Consistent Checkpointing", IEEE Transactions on Parallel and Distributed System, Vol. 10, No.7, July, 1999.
[39] Y. M. Wang, "Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints," IEEE Transactions on Computers, Vol. 46, No. 4, pp. 456-468, April 1997.
[40] Message Passing Interface Forum, "MPI: a Message-Passing Interface Standard," Tech. Rep. CS-94-230, Department of Computer Science, University of Tennessee, Knoxville, TN, 1994.
[41] Message Passing Interface Forum, "MPI2: Extension to the Message-Passing Interface," November 1995.
[42] Para++: C++ Bindings for Message Passing Libraries, INRIA RT-0174, June 1995.
QRCODE
 
 
 
 
 
                                                                                                                                                                                                                                                                                                                                                                                                               
第一頁 上一頁 下一頁 最後一頁 top