Lecture: 3 hours

Discussion: 1 hour

Prerequisite: Course 165A

Grading: Letter; project (50%), presentation (30%), homework (20%)

Catalog Description:
Scientific data integration, metadata, knowledge representation, ontologies, scientific workflow design and management. Offered in alternate years.

Expanded Course Description:

  1. Introduction to scientific data management: goals and challenges
  2. Scientific data models, transformations
    1. Generic data exchange formats (XML)
    2. Specialized data/file formats (netCDF, HDF5, FASTA, Nexus)
    3. Tree-based data transformations (XPath, XQuery, XSLT)
    4. XSLT, XQueryXML model and query/transformation languages
      1. Database integration, query rewriting
  3. Knowledge representation with ontologies
    1. From controlled vocabularies, taxonomies, to description logic ontologies
    2. Reasoning with ontologies
  4. Data integration
    1. Schema-mapping based approaches: Global-as-View (GAV), Local-as-View (LAV); Extensions
    2. Ontology-based extensions for data integreation
  5. Scientific Workflows
    1. Introduction/motivation: capturing in silico experiments as scientific workflows
    2. Application examples from diverse domains (e.g., bioinformatics, ecoinformatics, particle physics
    3. Formal models for scientific workflows: Petri nets, Kahn process networks, Synchronous Dataflow
    4. Scientific workflow design paradigms: Collection-Oriented Modeling & Design (COMAD), higher order/functional programming patterns
    5. Data and workflow provenance models

A selection of technical papers addressing specific topics will be used. No textbook is required.


There are two kinds of projects: implementation projects and research projects. In the former, the students will work with Java-based open source systems such as the Kepler workflow system ( and design and implement example workflows, e.g., to create a bioinformatics workflow that connects several “bio web services”. Thus, in implementation projects students work with existing software systems, but they typically will also implement project-specific extensions to that software.

For reasearch projects, students will read one or more research papers from a list of offered research topics (e.g., scientific data integration, ontologies and knowledge representation in scientific data management, scientific workflows). Students will then need to apply the results of the research papers to a specific problem (e.g., applying a certain query rewriting algorithm to a given integration scenario and set of queries). In general, the deliverable of a research project is a technical report that summarizes and compares the results of the studied papers, and their application to the given problem. Depending on the topic, the presented algorithms might have to be implemented and applied to the given problem instance.

Computer Usage:
For the Implementation Projects (IPs), students will primarily use and extend the Java-based Kepler workflow system, which is available under Linux, Windows and MacOS. Computer usage is not required for homework.


The course introduces data modeling, data integration, knowledge representation, and scientific workflow challenges and techniques with a focus on scientific applications. Advanced topics include: ontologies as formal metadata, reasoning with ontologies in description logics, semantic query rewriting/optimization using ontologies, models of computation and provenance for scientific workflows.

Instructor: B. Ludaescher, M. Gertz

Prepared by: B. Ludaescher (September 2007)

Overlap Statement:
There is no significant overlap with any other course.