
DISTINGUISHED LECTURER SERIES
"An Overview of High Performance Computing and Self Adapting Numerical Software "
Monday, May 16, 2005
1065 Kemper Hall
2:10 - 3:00PM
In this talk we will look at how High Performance computing has changed over the last 10-years and look toward the future in terms of trends. A new generation of software libraries and algorithms are needed for the effective and reliable use of (wide area) dynamic, distributed and parallel environments. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile--time and run--time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run--time environment variability will make these problems much harder.
As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the executing time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever there is a node failure, have to abort themselves and restart from the beginning or a stable storage based checkpoint. Along these lines we will discuss work on the development of fault tolerant based linear algebra algorithms. We will present an approach to building fault survivable high performance computing applications using diskless checkpointing with FT-MPI. We give a detailed presentation on how to write a fault survivable application with FT-MPI using diskless checkpointing and evaluate the performance overhead of our fault tolerance approach by using a preconditioned conjugate gradient equation solver as an example. Experiment results demonstrate our fault tolerance approach can survive a small portion of simultaneous processor failures with low performance overhead and little numerical impact.
Bio
Jack Dongarra is a University Distinguished Professor of Computer Science at the University of Tennessee and a Distinguished Research Staff at Oak Ridge National Laboratory (ORNL). He is the director of the Innovative Computing Laboratory at the UT which has a staff of 50 people doing research in the area of high performance computing. He is also the director of the Center for Information Technology Research at the UT. He has contributed to the design and implementation of the following open source software packages and systems: EISPACK, LINPACK, the BLAS, LAPACK, ScaLAPACK, Netlib, PVM, MPI, NetSolve, Top500, ATLAS, and PAPI. He has published approximately 200 articles, papers, reports and technical memoranda and he is coauthor of several books. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches. He is a Fellow of the AAAS, ACM, the IEEE and a member of the National Academy of Engineering.