Example 192/199 Documents
Name:
Major: Computer Science
Course: ECS 199 Fall 2007 4 Units
Instructor: Michael Gertz
ACCESSING AND EXPLORING SENSOR DATA IN THE COMET DATA PORTAL
Objective
Scientific data gathered from sensors and satellites are a vital resource for applications in weather
forecasting, monitoring climate change, water management and irrigation scheduling. Providing uniform
and transparent access to the data collected from these instruments in an interoperable format is the
key to utilizing the data for scientific data analysis and exploration. This feature is also an important
functionality in the COMET Portal, which is being developed in the NSF funded COMET project (Coast to
Mountain Environmental Transect) and is designed to provide access to data in various formats so it
can be utilized by scientists in their analysis and exploration tasks. So the goal of this project was to
enhance the existing Portal by adding functionality that allows users to select view and explore sensor
data represented in a time-series format.
CIMIS Data
The California Irrigation Management Information System (CIMIS) manages a network of 197 weather
stations located throughout the state of California. This network was developed in 1982 by Department
of Water Resources and University of California, Davis to manage water resources efficiently. The
weather stations collect data to calculate hourly and daily values. At midnight, computers connect to
each station to retrieve the data it collected that day. The data is in then stored in a database which the
portal is using. Some variables that the station collects include solar radiation, soil temperature, air
temperature, vapor pressure, wind speed, precipitation, humidity, dew point and wind run.
Implementation Details
The station data was stored in a PostgresSQL Database with a row containing the station id, date and
the measured values. For this project I focused on the daily precipitation values. This value is a real value
with the possibility of being null. I started by writing and testing the necessary queries to extract the
values of interest. Although, the primary focus was on one variable, I designed the system general
enough to allow multiple variables to be selected with only minor changes.
Name:
Major: Computer Science and Engineering Major
Course: ECS 199, 3 units, Spring 2001
Instructor: Michael Gertz
MAINTAINING INTEGRITY CONSTRAINTS ON XML DATA IN ORACLE 8i
As more and more public and private businesses use XML technologies to share data over the Internet, it becomes more important that XML provide the necessary integrity of that data. One way to ensure that the data makes sense is to impose and enforce integrity constraints on the data. These integrity constraints should have a meaning similar to their relational database counterparts. In a relational database management system, such as Oracle 8i, data is stored in tables with primary and foreign key constraints, as well as domain and cardinality constraints that ensure that the data makes sense and represents the real world as closely as possible. Because of the document-based structure of XML data, these constraints are not straightforward to implement and their meaning is not readily apparent.
My research for this quarter has focused on the meaning of some important integrity constraints that can be enforced on XML data, specifically by Oracle 8i. Before exploring these constraints, however, I first focused on the way XML data is mapped to a relational database. I then focused on some general constraints, such as primary and foreign keys. Lastly I examined some more interesting constraints that may be necessary because of the document-based presentation of XML data. In this report, I will describe my findings and finish with some ideas for future research.
The Mapping
The XML DTD (Document Type Definition) is usually a separate document that
contains a syntax for the data it represents. The DTD describes how the
data will be presented, which tags will be displayed in which order, and
what the content of those tags will be [1] (Other mappings are suggested
in [5] and [6]). The mapping of an XML document to a relational database
depends entirely on the DTD to which the document conforms. A DTD consists
entirely of Elements, Attributes and PCDATA (or text). Such a simple structure
should result in a simple algorithm for mapping the data stored in an XML
document to a relational database. The general approach will be to consider
each Element as an Object (in the Object-Oriented sense) and then use the
Object-Relational model to map the objects to tables in the database. This
idea comes from the XML-DBMS program written by Ronald Bourret [2]. The
mapping is based on the following rules: First, an Element may contain other
Elements, Attributes and text. Each such Element is considered as an Object
and so a table is created in the database with an artificial primary key
column that each sub-Element can reference. Some Elements may contain both
Elements and text. In this case, a separate table is created for all of
the text elements, with a foreign key column to reference its parent Element.
In Figure 1, the Element ChordML contains a series of other Elements and
Attributes, so there is a table, ChordML, which contains columns for each
of the Attributes and those sub-Elements that consist only of text. For
the (lyrics, sugend?)+ part there are separate tables, since these sub-Elements
are not text only. Note that there is also a ChordMLPK column that contains
the primary key, and there are also some columns that will keep track of
the order in which the sub-Elements appear in the XML document, since order
may be important to the meaning of the data. The lyrics element in Figure
1 is an example of an Element that contains both sub-Elements and PCDATA.
Within a <lyrics> tag, there may be a series of strings intermixed
with <a> elements, which contains more text and some attributes. The
lyrics Element is mapped to two tables in the database: lyrics and lyricsPCDATA.
These two tables keep the metadata (the information about which lyrics come
in what order and under which ChordML object) and the text (PCDATA) separate.
Next, the columns of the above mentioned tables are determined in the following way: for each single-valued Attribute, or for each singly occurring Element that contains only PCDATA, create a column in the table. Each Attribute of the ChordML Element (title, version, author, translator, idiome) is mapped to a single column in the ChordML table. Also, the copyright and textref sub-Elements, each of which can occur only once per ChordML Element, also get a column in the ChordML table. Since the textref sub-Element has the =91?=92 character accompanying it, this means that its presence is optional. Therefore, that column in the table should be allowed to be NULL.
Next in the mapping of a DTD is to consider those Attributes and sub-Elements that are multiply occurring (such as those with a + or *). These Attributes and sub-Elements get their own table in the database. An example of this from Figure 1 is the lyrics Element as mentioned above.
Another example is in the DTD in Figure 3. The sub-Element Dim within the Array Element is mapped as its own table. The Dim table contains a foreign key column to reference which Array Element it falls under. The last consideration for mapping Elements of a DTD to tables in the database is to consider Elements or Attributes that have the same name as other Elements or Attributes in the same DTD.
Consider the DTD fragment from Figure 5. The Company and Contact Elements both have a multiply occurring Element named Email. When parsing the DTD to determine how to build tables, we first come upon the Company Element and find that Email is a multiply occurring sub-Element, so we create a table Email with a foreign key column to the Company table. Then we come to the Contact Element and again see the Email is a multiply occurring sub-Element. We cannot create a new table called Email because one already exists. The XML-DBMS program does not support such a DTD at this time [3]. One possible solution is to append the parent Element's name to the table to get unique table names: Email_Company and Email_Contact. This change would have to be noted before parsing any XML document so that when it comes across the <Company> or <Contact> Elements and the subsequent <Email> tags it can note which parent the Email belongs to. Another solution is to have two foreign key columns in the Email table: one for Company, one for Contact. The two columns would have to be mutually exclusive however: for each tuple in the Email table only the CompanyFK OR the ContactFK column could have a value, never both.
Primary Keys
The concept of a primary key in a relational database is that a key is associated
with each table such that the column(s) that make up the primary key always
have unique values. As mentioned in the mapping of Elements and Attributes
in a DTD to tables in a database, an artificial primary key column is generated
and presumably is filled with some counter that is incremented when a row
is entered into the table. But this says little or nothing about the data
itself. In the DTD in Figure 1, for example, a primary key might consist
of the title, author and version Attributes together. A parsing program
is unlikely to figure this out for itself, and that=92s why a simple, artificial
primary key column is used. Currently there is no way to express this kind
of primary key in a DTD. The only feature that comes close is the ID Attribute
type (see [1]). Any attribute that is specified as an ID must have unique
values throughout the entire XML document. Suppose, however, that there
is a series of documents that contain data that conforms to a given DTD.
Within each of the documents the ID fields are unique, but across all documents
there are duplicates. This causes a problem when entering data into the
database.
Also, only one Attribute per Element can be an ID. This would not be appropriate for the DTD of Figure 1, where we would like to have three Attributes combined to form unique values. Some Elements are mapped to tables but have no Attributes. One such Element is the sugend Element of Figure 1. It might make sense that the column that contains the PCDATA for this Element be a primary key, but since it is the body of the Element and not an associated Attribute, there is no way to express this in the DTD. If there were a way to show which Attributes (or sub-Elements) in a DTD make up a primary key for that Element, then we would have to figure out how to enforce such a constraint. One way to do this would be to parse the DTD and find those Attributes and sub-Elements which are marked as part of a primary key and, when creating the table for the database, mark the associated columns as primary key columns, which must be not null. Then, when XML documents are parsed and data is entered in the database from the document, the database itself will check that the integrity constraint is maintained.
A problem can occur with this solution. Suppose that instead of the ChordML Element containing its many Attributes there were another sub-Element named Info as in Figure 6. The sub-Element info would be mapped to a separate table with columns for each of its sub-Elements. In this case, parts of the Info Element make up the primary key for the ChordML Element, but there isn=92t even a column for that information in the ChordML table. One possible, albeit very difficult, solution might be to consider the Info table to be the parent table with the appropriate primary key (title, author, version). Then the ChordML row associated with that Info row would have a foreign key column(s) reference to the Info table. This would be extremely difficult to program into a DTD parser since it really would be out of order from the presentation in the DTD. Another potential problem concerning primary keys is keys that extend across a series of tables. For instance, consider the DTD fragment of Figure 6, but the title Element occurs as a sub-Element of the ChordML Element instead of the Info Element. Then our primary key consists of columns in both the ChordML table and the Info table. A possible solution to this problem is to create a table for metadata. This table might consist of the primary key columns (title, author and version) and an additional column, which the two sub-tables (ChordML and Info) can reference. The primary key issue is one that can be solved, but a uniform solution may still need to be researched.
Foreign Keys
Foreign key constraints are even more interesting than primary keys. The
foreign keys make no sense without primary keys, so it is important that
the primary key issues are handled before discussing foreign keys. Assuming
that the primary keys for a DTD can be determined according to one of the
solutions discussed in the previous section, then foreign keys are the next
issue to consider.
The foreign key is used to identify relationships among tables in a relational database. In an XML document, they should be used in a similar way, to represent child Elements and its relationship with a parent Element. According to the mapping discussed above, tables that represent sub-Elements contain a foreign key column that references the primary key column of its parent Element. This use of a foreign key is consistent with the meaning of a foreign key in a relational database. However, these foreign keys are, again, artificial, and don=92t necessarily represent any real world data. Currently, the only method for identifying a foreign key in a DTD is by using the IDREF(s) Attribute type [4]. This must always be used in conjunction with the ID Attribute type mentioned above in relation to primary keys. The IDREF Attribute value must match the value of an ID Attribute somewhere else in the XML document. This may suffer from the same problems as the ID Attribute type, specifically in the situation that the data is spread across several different XML documents. If the IDREF refers to an ID Attribute value in a separate document, it may not be recognized as being related to that Element.
Aside from the ID/IDREF(s) convention, a foreign key can still exist. As mentioned, it serves to illustrate a relationship between two tables in a database. In XML, a foreign key can be thought of as representing the relationship between two Elements. If there is a 1:1 relationship between two Elements, then the one can be represented as a column in the table of the other (this is the case of a single valued Element in the mapping of a DTD). If instead there is an N:1 relationship (a multiply occurring sub-Element or multi-valued Attribute) then a separate table is created with a foreign key to the other Element. The real question is how this is enforced in the database. Consider again, the situation where a primary key is spread across more than one table. There is no way in Oracle to specify a foreign key that references multiple tables. It clearly would not work to simply choose one of the many tables that contains the primary key and reference that one, since that chosen table may not always have a unique value for the chosen column. If, however, the solution of creating a metadata table that contains the primary key columns and an index column, then the sub-Elements can reference that index column in the metadata table. This might present a problem of its own, however, since it now may be difficult to determine which tables that reference the metadata table are the tables that contain the primary key (Info and ChordML), or some sub-Element (like lyrics). In this case, it may be impossible to determine whether lyrics is a sub-Element of the Info Element or of the ChordML Element.
Critical Operations
When it comes to enforcing integrity constraints, one must identify the
critical operations that could result in a violation of data integrity.
Operations performed on an XML document can have a profound impact on the
data already stored. First, consider insertions into the document. Insertions
into a database usually concern only the new data being inserted in terms
of the integrity constraints that could be violated. Insertions into an
XML document, however, can be more complex. An XML document is a tree of
nodes with child nodes (sub-Elements) referring to the parent nodes and
ending in leaf nodes (PCDATA). If a new leaf node is inserted, this usually
does not upset the tree as a whole and, so long as it conforms to the DTD,
will not cause a constraint violation. Insertion of a node into the middle
of the tree, however, can be quite damaging. The new node will be a child
of some other node and the reference will have to be checked. Suppose we
have the sample XML document in Figure 9 that conforms to the Product DTD
of Figure 8. Each Link Element in the document is a leaf node. Adding an
additional Link Element under, say, the first Item Element is perfectly
fine and causes no trouble. If we add another Para Element before the end
of the Description, this node will go into the middle of the document. Suppose
that we want all Links in the document to be unique. Then, if an additional
link were added to the new Para Element, we would somehow have to check
that it was different from all other Link Elements already in the document.
Lastly, inserting a new top-level node also could cause problems. If we
add a new Product Element to the document of Figure 9 and we want all product
names to be unique, the new Product node will have to have its Name Element
checked to ensure that it is different from XML-DBMS, the other product
name already in use. Also suppose that the Link Elements must be unique
as before, except only in reference to their particular product. So, for
instance, a new Product could contain a Link to xmldbms.zip as well. Then
we would need to keep track in the Link table of which Product it comes
from (not only which Item or Para Element it comes from as would be done
now). Then, when a new Product node is added to the end of the document
the Link Elements contained within will have to be unique, but only across
those links in the new Product. If the primary key for the Link table previously
consisted of the URL, it now must consist of the URL AND the Product under
which it will be displayed.
Almost any update operation on a database table is a critical operation for integrity constraints. The same is true of an update operation on an XML document, several of which are defined in [7]. Changing anything about an XML document could cause a violation of integrity constraints. Insertions and deletions are considered separately. Updating an XML document then means to change the values or order of the existing Elements, Attributes and text in the document. Let=92s again consider the document of Figure 9. Suppose we rearrange the order of the Link Elements. This should not cause any violation, except that the order columns in the Link table will need to be updated with the change. Suppose, instead, that we have the situation previously described where a new Product node is added with its own Links and we want all Links within each Product to be unique. Now, if we take all Link elements in the document (from both Product nodes) and rearrange them, we may have an integrity constraint violation. If one of the Links from the first Product matched one of the Links from the second Product this would have been considered acceptable. But if, after rearranging all of the Link Elements, these two Links end up in the same Product, all Links are no longer unique within each Product. This is only one example of how updating an XML document can violate integrity constraints.
Lastly, we consider how deleting nodes and text from an XML document may affect integrity constraints (again, see [7]). This is perhaps the most interesting of the potential critical operations. As with insertions, deletion of a leaf node is typically not a critical operation. For instance, if we later removed the second Product node and returned the document to the state of Figure 9, no constraints could be violated. But suppose that we have an additional Product Element as before and contained within this second Product Element is a reference somewhere to the first Product Element. Suppose the Description Element contains an ID Attribute because some products have the same description and there is no need to repeat the same information over again. Then suppose the second Product contains the same Description as the first Product so there is an IDREF. If there is an attempt made to delete the first Product node, then the second Product will have no Description. One solution is to require cascading deletes. When a node that is referenced is deleted, all nodes that reference the deleted node must also be deleted. In this example, deleting the first Product node would require the entire document be deleted. Another solution is to change all references to the deleted node to a NULL element. This may or may not be acceptable if the referencing Element is optional. A third solution is to simply not allow referenced nodes to be deleted. Another situation in which deletion causes a constraint violation is when we wish to delete only the List node (in Figure 9) but keep the Item and Link Elements below. First, the Item Elements would now be misplaced in the document and the document would no longer conform to a DTD. But even if we wanted to keep the Links intact, we could do so by moving them to the Para Element above. The subsequent references in the database that keep track of which Para or Item Element a Link belongs to would have to be updated to avoid a constraint violation.
Other Constraints
There are many other types of integrity constraints that can be imposed
on data. Some constraints such as domain and range constraints and cardinality
constraints are considered in XML Schemas ([8] and [9]), an alternative
to DTDs in describing XML documents. Other, more interesting constraints
are constraints that are specific to XML because of its tree-like structure.
These constraints are usually easy to describe in plain English, but very
difficult to enforce and almost impossible to describe in a DTD. These are
constraints where the data itself can determine the structure of the XML
document. I will give three examples. The first example may be a trivial
one. Say we have an XML document that contains a catalog for a large e-business
that sells a variety of goods. Suppose this document has a node for each
item or product carried by the company. The first Element could be the type
of product, i.e. book, CD, Video Tape. Then this would require that the
structure that follows would depend on that value. For a book, we might
need the author(s), the type of book (fiction, non-fiction, biography, etc.)
the length in pages, physical dimensions, and perhaps the name and dimensions
of the image file that has a picture of the cover. Then for a CD we would
need artist(s), running time, type of music (jazz, rock, rap, etc.), and
maybe even links to short audio clips of various tracks. Of course, this
could be considered to be bad design, since each type of item really should
be a different Element. Then the DTD will simply be able to specify the
proper structure that constitutes a <BOOK> or <CD> Element.
The second is perhaps a more interesting example. Say we have an XML document that contains information about various commercial airline flights. Suppose one Attribute for a given flight is the distance traveled. If this value is greater than 1,000, for instance, then that means that a meal will be served on the flight. If that's the case, then depending on the distance flown, there would have to be an additional Element that references some <MENU> element in a different part of the document. The third, and perhaps most convincing example of how data can determine the structure of an XML document is as follows. Say we have an XML document that contains Tax return information for a number of people. At the beginning of each <RETURN> element is an element <FILING STATUS> that can be joint married, single, dependent, etc. Then, obviously a great deal of different information is required depending on the type of return. This is also just for personal income tax returns, to say nothing of how this might affect businesses and corporations who file tax returns in a very different way. Also, since some states already allow people to submit tax returns online, this is a rather important issue to understand.
Conclusions and Future Work
The topic of describing integrity constraints in XML and enforcing them is one of great magnitude. Unfortunately, in only 10 weeks of study I have barely scratched the surface. There is a wealth of very important research yet to be done regarding this issue. So far, I have described many of the possible constraints and some of the ways in which they can be violated. I have not, however, learned much about a variety of possible solutions to these problems. Much of my research is based on DTDs, which are very unexpressive. A way to expand the expressiveness of a DTD should be researched so that DTD parsers can recognize some of these constraints. The constraints I have discussed and researched are only a small sample of the innumerable possible constraints for the innumerable possible XML documents.
What I have learned in my research is that it is essential that these integrity constraints be understood and that database management systems incorporate XML parsers that can enforce these constraints. If businesses and organizations want to be able to share their data using XML (which is an excellent way to share data), they must be made aware of the potential for inaccurate data that exists today. Before data that is transferred via XML may be trusted for mission- or life-critical operations (such as medical data for hospital patients) users must be aware of the fact that the integrity of that data is in no way guaranteed. This topic can be studied in much greater detail and must be so for XML to grow and mature as a useful medium of information exchange.
Appendix A: Figures
<!ELEMENT ChordML (copyright, textref? , (lyrics, sugend?)+)><!ATTLIST ChordML
title CDATA #REQUIRED
version CDATA #IMPLIED
author CDATA #IMPLIED
translator CDATA #IMPLIED
idiome CDATA #IMPLIED>
<!ELEMENT copyright (#PCDATA) >
<!ELEMENT textref (#PCDATA) >
<!ELEMENT a (#PCDATA) >
<!ATTLIST a
c CDATA #REQUIRED<!ELEMENT lyrics (#PCDATA|a)*>
al CDATA #IMPLIED > <!-- alternative -->
<!ELEMENT sugend (#PCDATA) >
Figure 1: ChordML DTD [10]
CREATE TABLE a
(c VARCHAR(255),
al VARCHAR(255),
aPCDATA VARCHAR(255),
lyricsFK INTEGER,
aPCDATAOrder INTEGER,
aOrder INTEGER)
CREATE TABLE lyrics
(ChordMLFK INTEGER,
lyricsOrder INTEGER,
lyricsPK INTEGER)
CREATE TABLE ChordML
(copyrightOrder INTEGER,
ChordMLOrder INTEGER,
textref VARCHAR(255),
version VARCHAR(255),
copyright VARCHAR(255),
translator VARCHAR(255),
author VARCHAR(255),
title VARCHAR(255),
textrefOrder INTEGER,
idiome VARCHAR(255),
ChordMLPK INTEGER)
CREATE TABLE lyricsPCDATA
(lyricsPCDATA VARCHAR(255),
lyricsPCDATAOrder INTEGER,
lyricsFK INTEGER)
CREATE TABLE sugend
(ChordMLFK INTEGER,
sugend VARCHAR(255),
sugendOrder INTEGER)
Figure 2: ChordML create table statements
<!ELEMENT Company (=85, Email*, =85)>
=85
=85
<!ELEMENT Contact (=85, Email*, =85)>
=85
<!ELEMENT Email (#PCDATA)>
Figure 3: Example of different Elements with the same name
<!ELEMENT ChordML (Info, copyright, textref? , (lyrics, sugend?)+)>
<!ELEMENT Info (title, version, author, translator, idiome)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT version (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT translator(#PCDATA)>
<!ELEMENT idiome (#PCDATA)>
=85
Figure 4: Modification to ChordML DTD
<!ELEMENT Product (Name, Developer?, Summary?, Description?)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Developer (#PCDATA)>
<!ELEMENT Summary (#PCDATA)>
<!ELEMENT Description (Para | List)+>
<!ELEMENT Para (#PCDATA | Link)*>
<!ELEMENT List (Item+)>
<!ELEMENT Item (#PCDATA | Link)*>
<!ELEMENT Link (#PCDATA)>
<!ATTLIST Link
URL CDATA #REQUIRED>
Figure 5: Product DTD [11]
<Product>
<Name>XML-DBMS</Name>
<Developer>Ronald Bourret, Technical University of Darmstadt</Developer>
<Summary>Java packages for transferring data between
XML documents and relational databases</Summary>
<Description><Para>XML-DBMS is a set of Java packages for transferring data
between XML documents and relational databases. It views the XML document
as a tree of objects in which element types are generally viewed as classes
and attributes and PCDATA as properties of those</Para>
<Para>You can:</Para>
<List><Item>
<Link URL=3D"readme.html">Read more about XML-DBMS</Link>
</Item>
<Item><Link URL=3D"XMLAndDatabases.htm">Read more about databases and XML</Link>
</Item>
<Item>
<Link URL=3D"xmldbms.dtd">View the mapping language DTD</Link>
</Item>
<Item>
<Link URL=3D"xmldbms.zip">Download XML-DBMS</Link>
</Item>
</List>
<Para>XML-DBMS, along with its source code, is freely available for
use in both commercial and non-commercial settings.</Para></Description>
</Product>
Figure 6: Example Product XML Document [12]
References
[1] Bray, Tim. Paoli, Jean. Sperberg-McQueen, C.M. Maler, Eve. (ed.), Extensible Markup Language (XML) 1.0 (Second Edition), W3C, October 6, 2000. (http://www.w3.org/TR/REC-xml).
[2] Bourret, Ronald. Mapping DTDs to Databases. XML.com. May 9, 2001. (http://www.xml.com/pub/a/2001/05/09/dtdtodbs.html).
[3] Bourret, Ronald. Re: XML-DBMS can not handle some DTDs? Yahoo! Groups: xml-dbms Messages. April 17, 2001. (http://groups.yahoo.com/group/xml-dbms/message/992).
[4] Fan, Wenfei. Simeon, Jerome. Integrity Constraints for XML. Pg 2-5.
[5] Kappel, Gerti. Kapsammer, Elisabeth. Retschitzegger, Werner. X-Ray Towards Integrating XML and Relational Database Systems. July 2000.
[6] Zhang, Xin. Mitchell, Gail. Lee, Wang-Chien. Rundensteiner, Elke A. Clock: Synchronizing Internal Relational Storage with External XML Documents. 2001.
[7] Tatarinov, Igor. Ives, Zachary G. Halevy, Alon Y. Weld, Daniel S. Updating XML.
[8] Bonifati, Angela. Ceri, Stefano. Comparative Analysis of Five XML Query Languages.
[9] Fallside, David C. XML Schema Part 0: Primer. W3C. March 16, 2001.
[10] Frederico, Gustavo. ChordML DTD, XML.org. January 14, 1999. (http://www.cifranet.org/xml/chordml.dtd).
[11] A sample included with the XML-DBMS program.
[12] Bourret, Ronald. XML-DBMS: Version 1.01. (http://www.rpbourret.com/xmldbms/readme.htm).