System Designs for Visualizing Large-Scale Scientific Data



Organizer:
Kwan-Liu Ma, ICASE


Presenters:
Michael B. Cox, MRJ at NASA Ames Research Center
Christopher R. Johnson, University of Utah
Kwan-Liu Ma, ICASE
William J. Schroeder, Kitware Visualization Solutions


Leading-edge scientific and engineering computations and experiments can generate data of unprecedented size and complexity, which present new challenges to scientists/engineers who need to analyze and visualize the data. There have been various efforts from academia, industry, national laboratories, and the government to meet the pressing need for new methods of handling massive datasets. A consensus is the need of an integrated system approach and solution. In this course, we highlight research efforts focusing on visualization software system designs addressing the large data problems, and the following four topics will be covered:

  1. Large data management for interactive visualization design
  2. Adapting data-flow systems to large datasets
  3. Parallel visualization systems
  4. Visual computing and interactive steering

The first lecture lays the groundwork for the rest of the course. Michael Cox gives an overview of the problems of extremely large data sets in scientific visualization, and a review of current solutions and research directions with an emphasis on data management for interactive visualization design. Michael begins with the distinction between "big data collections" and "big data objects". Big data collections are extremely large collections of scientific data. Any single data set in such a collection may be small, perhaps 100 megabytes, but in aggregate the collection may comprise terabytes or petabytes. Big data objects are just that -- extremely large individual data objects such as vector or scalar fields output from computational physics simulations. His lecture is concerned with approaches for visualization of "big data objects."

Michael continues with a discussion of a number of characteristics of the application that determine which approaches may be productive and when, and which potentially not. These characteristics include:

  1. Query vs. browse.
  2. Direct rendering vs. algorithmic data traversal.
  3. Static vs. dynamic data.
  4. Data dimensionality and organization.
He then explores current techniques for management of extremely large data sets. These techniques are explored in the context of the application characteristics above, and include
  1. Memory hierarchy and system solutions.
  2. Indexing.
  3. Compression.
  4. Multiresolution.
  5. Data mining and feature extraction.
For the above approaches, examples and previous successes are discussed, as well as some of their limitations. Where techniques are unavailable, or not yet proven, opportunities and promising research directions are offered.

Conventional data-flow based systems are widely used in the visualization community. Their success has been partly due to the natural application of the data-flow approach to the visualization process, which involves several transformation steps to map data into sensory representations. Unfortunately, the typical implementation of these systems requires passing entire datasets through the pipeline. This approach fails when the data size becomes large, since physical and virtual memory is exhausted.

In the second lecture, William Schroeder introduces an alternative implementation of a data-flow visualization system. Instead of processing entire data sets, a streaming approach that processes pieces of a dataset is introduced. William describes the issues involved in implementing such an approach including handling boundaries, extensions for multithreading, mapping input to output, the effect of memory limitations, and generating processing-order invariant results. He also describes a successful streaming implementation in the freely available Visualization Toolkit (vtk) system, present results and perform a live demonstration.

Increasingly, scientific computations with demanding memory and processing requirements are being performed on massively parallel supercomputers such as T3E, SP2 and Origin 2000. An example is DOE's ASCI program plan. To support applications which use these MPP supercomputers, visualization tools appropriate to the parallel architecture are being developed to make possible high fidelity visualization of the application data sets. Kwan-Liu Ma illustrates the design issues for parallel visualization systems used either for postprocessing of the data or for runtime monitoring of the simulation.

Existing parallel rendering algorithms for distributed memory architectures scale well only to a few hundred processors. Beyond that, communication overheads tend to inhibit further performance gains. It is crucial for visualization calculations not to compete parallel computing resources with the simulation calculations. Kwan-Liu explores new algorithms and new ways of structuring a parallel visualization system to achieve maximum performance with a limited number of processors and storage space.

By providing immediate visual feedback from large-scale simulations, the engineering or designer can more easily determine if the computation is headed toward a useful result. If not, the user has the option of terminating the computation, or adjusting simulation parameters on-the-fly. By incorporating visualization directly into applications, the need for time-consuming post-processing steps is reduced, and unexpected behavior or anomalies in the data can be spotted more quickly and more easily. The net result is fast design cycles and higher-quality solutions.

While Kwan-Liu's lecture covers runtime visualization, the focus is how the visualization requirements and limited resources may impact the design of parallel rendering algorithms. In the last lecture, Chris Johnson presents a proof-of-concept of an originally ambitious attempt -- computational steering. His lecture is centered around a problem solving environment called, SCIRun. SCIRun is a scientific programming environment that allows the interactive construction, debugging, steering and visualization of large-scale scientific computations. SCIRun can be envisioned as a ``computational workbench,'' in which a scientist can design and modify simulations interactively via a dataflow programming model. SCIRun enables scientists to modify geometric models and interactively change numerical parameters and boundary conditions, as well as to modify the level of mesh adaptation needed for an accurate numerical solution. As opposed to the typical ``off-line'' simulation mode - in which the scientist manually sets input parameters, computes results, visualizes the results via a separate visualization package, then starts again at the beginning - SCIRun ``closes the loop'' and allows interactive steering of the design, computation, and visualization phases of a simulation.

We propose a half-day course (3.5 hours) on the above topics which we believe are most relevant based on recent technology advances and current demands from science and engineering applications. Each presenter will be given 45-55 minutes to cover enough background materials for the audience to explore each topic further through reading the collections of papers included in the course notes.

A more concise syllabus with estimated timeline is given as follows:

08:30 - 08:35 Opening Remarks, Ma
08:35 - 09:30 Large data management for interactive visualization design, Cox
09:30 - 10:15 Adapting data-flow systems to large datasets, Schroeder
10:15 - 10:30 break
10:30 - 11:15 Parallel visualization systems, Ma
10:15 - 12:00 Visual computing and interactive steering, Johnson
12:00 - 12:10 Open Discussions