No Title

On page 451, the proof of Theorem 17.1.1 presents an algorithm for building an ultrametric tree. The algorithm is correct, however Theorem 17.1.3 states that the algorithm can be implemented in O(n²) time. In fact, I don't see how to do that, although O(n²log n) is an easy bound for the algorithm. Rather than continuing to try to find a clever implementation of that algorithm, here is another combinatorial algorithm that I claim is correct and that does run in O(n²) time. The algorithm is described in terms of a graph G, based on matrix D, but it can be implemented without an explicit graph.

Let each row i of matrix D be represented by a node i in G, and each edge (i,j) be given the value D(i,j). In O(n²) time, the algorithm will find a very particular path in graph G:

set N equal to all the indices 1 through n; set L to the empty path; set i to any node.

repeat n-1 times: begin remove i from N; find an index j in N such that $D(i,j) \leq D(i,k)$ for any k in N. place edge (i,j) in the path L; set i to j; end;

What this produces is a path L of exactly n edges, and the algorithm can be implemented in O(n²) time. It turns out that L is a minimum spanning tree of G, but that fact is not needed.

We will now use L to create the ultrametric tree recursively.

Concentrate on an edge (p,q) in the path L with the largest edge weight of all edges in L, and let P be the set of nodes at or to the left of p in L, and let Q be the set of nodes at or to the right of q in L. The fact that D is an ultrametric matrix implies that for any pair of nodes (i,j) where i is in P and j is in Q, D(i,j) = D(p,q). One way to prove this is by induction on the number of edges between i and j in L (applying the ultrametric condition that the in any triangle, the max of the three edge weights is not unique). What this means is that in the ultrametric tree we are building (and in any ultrametric tree for D), any pair of leaves (i,j) where i is in P and j is in Q must have their least common ancestor at the root of the ultrametric tree, and that root must be labelled D(p,q).

If there are k > 1 ties for the global max edge weight in L, then removing those k edges creates k+1 subpaths of nodes, and applying the above argument, any two nodes i and j which are in different subpaths must have their least common ancestor at the root of the tree, which again must be labeled D(p,q). Hence, any ultrametric tree T for D must have exactly k+1 edges out of D, and the leaf set below any such edge must be exactly the (distinct) set of nodes in one of the k+1 subpaths.

No matter what k is, removing the k max weight edges in L, and partitioning N, takes only O(n) time.

To continue the description of the algorithm, we assume for convenience that k = 1. Let LP and LQ denote the two subpaths created by removing the max weight edge in L. Now we want to find an ultrametric tree for set P and one for set Q; these two ultrametric trees will then be attached to the root to creat the full ultrametric tree for D. But note that we already have the needed paths LP and LQ that would be created if we were to recursively apply the above method (clearly LP could result if we applied the path building algorithm to P alone, and similarly for LQ and Q). So we only need to find the max weight edge(s) in LP and the max weight edge(s) in LQ. Those two edges can be found in O(n) total time. Again, because the nodes were partitioned in the first step, this time bound holds even for k > 1.

Continuing, we build the ultrametric tree in O(n²) total time.

Note that at each step of the algorithm, the node partitions that are created, and the associated edges that are put into T, are forced. Hence if D is an ultrametric matrix, the ultrametric tree T for D is unique.

The algorithm presented on p. 147 for finding all the maximal pairs in a string in linear time was first discovered and published by Brenda S. Baker in

1. Brenda S. Baker, A program for identifying duplicated code, Computing Science and Statistics 24 (1992), Interface Foundation of North America, pp. 49-57].

The algorithm can additionaly be found in:
2. Brenda S. Baker, A Theory of Parameterized Pattern Matching: Algorithms and Applications (Extended Abstract), Proc. 25th ACM Symposium on Theory of Computing, May, 1993, pp. 71-80.

3. Brenda S. Baker, Parameterized Duplication in Strings: Algorithms and an Application to Software Maintenance, SICOMP 26,5 (Oct. 1997), pp. 1343-1362.

The algorithm has also been patented in U.S. Patent #5,953,006, filed on March 18, 1992 and issued on Sept. 14, 1999.

About this document ...

Next: About this document ...

Dan Gusfield
1998-09-29