Alogrithms on Graphs

William D. Shoaff
Florida Institute of Technology

Alogrithms on Graphs

Many (most, perhaps all) interesting problems in computer science can be formulated (and sometimes solved) in terms of graphs. A graph G is a pair (V, E) where V is a finite set of vertices (or nodes) and E is a collection edges between vertices. Edges may be directed or not (distinguished by the use of ordered pair (a,b) or set $\{a,b\}$ notation). Often edges are weighted or otherwise labelled; nodes can also store state information. Special types of graphs, such as trees, directed acyclic graphs, and bipartite graphs are important in some applications. The problems we want to solve most often involve construction of a graph with some property or determination of whether or not a graph has some property. The graph questions we explore have efficient time and space solutions, however, many interesting graph problems do not seem to.

Reachability

Reachability is a classic graph property that asks if node b can be reached from node a; there are many real-world applications of reachability.

A simple search algorithm can be used to solve an instance of reachability works as follows: Throughout the algorithm a set of vertices, denoted by S is maintained. Initially, $S=\{a\}$. Each node can be either marked or unmarked. That node i is marked means that i has been in S at some point in the past or is currently in S. Initially, only a is marked. At each iteration of the algorithm, choose some node $i\in S$ and remove it from S. Each edge $(i,\,j)$ out of i is processed: if j is unmarked, mark it and add it to S. Continue this until S becomes empty. At this point, answer ``yes'' if n is marked and ``no'' otherwise.

The informal statements above can be written in pseudo-code.

Minimal Spanning Trees

We will present Kruskal's algorithm for minimal spanning trees.

The pseudocode for Krusal is at ../kruskal.html Analysis of Kruskal's Algorithm

Find-Set and Union Operations on Sets

Union By Rank

`;=3000 $\mbox{\it Union}(A,\, B)$$\{$    if $\mbox{\it Rank}(A) > \mbox{\it Rank}(B)$ then         S[B] := A;     else         S[A] := B;         if $\mbox{\it Rank}(A) = \mbox{\it Rank}(B)$ then              $\mbox{\it Rank}(B):=\mbox{\it Rank}(B)+1$; $\}$

Find-Set with Path Compression

Ackermann's function and its inverse:

We'll start by examining repeated exponentials

\begin{displaymath}2^{2^{2^{\ddots^{2}}}}\end{displaymath}

Spefically, define

\begin{displaymath}g(0) = 2,\, g(1) = 2^2,\, g(i) = 2^{g(i-1)}\end{displaymath}

For example, g(2) = 2g(1)=222=24=16, g(3) = 2g(2)=216=65536, g(4) = 2g(3)=265536. Clearly the function g grows very quickly.

Now consider the function $\lg^{(0)}n = n$, $\lg^{(1)}n = \lg n$, $\lg^{(i)}n = lg(lg^{(i-1)}n)$ if i>0 and $\lg^{(i-1)}n > 0$. And define $\lg^{*}n = \min\{i\geq 0 : \lg^{(i)}n \leq 1\}$. Note that

\begin{displaymath}\lg^{*} g(n) = n + 1\end{displaymath}

Ackermann's function is defined by Define the Ackermann function by

\begin{eqnarray*}A(1,\,m) & = & 2^m \\
A(n,\,1) & = & A(n-1,\,2) \\
A(n,\,m) & = & A(n-1,\,A(n,\,m-1)) \\
\end{eqnarray*}


Graphs for Text Algorithms

Pattern matching automata

Finite state machines

Aho-Corasick

Dictionaries and Indexes

A dictionary or lexicon is a collection of all word in a language organized so each word can be accessed quickly. It is also useful for lexicographers to be able to insert or delete dictionary words. An index is used to find all occurances of a word in text.

Suffix trees

Let a [[text]] be fixed. Our interest is to devise a data structure that efficiently represents each factor (substring) of [[text]]. Let [[text.factors()]] be a function returning this data structure, call its type [[Factors]]. We would like to construct an object type [[Factors]] in linear time and space, and we'd like [[Factors]] to provide an O(|w|) answer to the question ``is w a factor in [[text]]?'' Other operations should also be efficient.

Retrieval Trees

The Traveling Salesman Problem

Below a heuristic algorithm is presented in C for the traveling salesman problem; you are to alter the algorithm so that (1) it is still correct and (2) it is 2 times faster than the given code. To say the algorithm is heuristic means that it uses a ``good idea'' in attempting to solve the problem, but the algorithm may not always produce the ``correct'' answer.

The traveling salesman problem is a classic example of an NP-complete problem -- NP problems are ones where if you are given the answer you can verify it is correct quickly, but it is often very time consuming to compute the answer. A problem is NP-complete if it is NP and it is as ``hard'' as any other problem in NP (we won't go into the technical definition here).

The traveling salesman decision problem says, given a set of cities (represented as nodes in a graph) and distances $d(c_i,\,c_j)$ between cities ci and cj (represented as weighed edges between pairs of nodes in the graph), and a maximum cost M, is there a tour of all the cities which costs M or less? That is, is there a simple cycle from c0 to c0 that passes through every other node in the graph exactly once such that the sum of the weights along each edge of the cycle is less than or equal to M?

Essentially, the only way known to solve the traveling salesman problem is to compute the cost of all tours and see if one has cost less than or equal to M. If there are n cities, then there are (n-1)! possible tours that start at the first city. (By Stirling's formula $n! \approx \sqrt{2\pi n} (n/e)^{n}$, so the number of tours to check grows very large very quickly.) Exploring all tours is called a brute force approach.

The heuristic we will use is called the ``nearest neighbor'' rule: starting from c0, find the city closest to c0 and travels to it incrementing the ``cost'' of travel (which was initialize to 0). Then, the process is repeated from the first city visited to all of the unvisited cities. Once all the cities have been visited, return to city c0. The nearest neighbor heuristic executes in O(n2) steps instead of the O(nn) steps needed for the brute force approach; thus, it is a polynomial time algorithm (n2) and is tractable, while the brute force approach is exponential (nn) and is untractable.

Code Optimization

Your job is to make the code at least twice as fast, that is, if when you run the code in your computer environment it takes t seconds, then after your improvements it should run in t/2 seconds or less. Of course, your improved code must still execute the same algorithm, so it will still be of order n2, but it must execute the algorithm faster.

A few remarks that should be obvious. First, it is not fair to change computer environments -- if the runs in t on a PC and t/2 seconds on a Cray, you can't say you've optimized the code and fulfilled the requirement. As well, you can't change the compiler, or switch from un-optimized to optimized code to fulfill the requirement.

And a few remarks about the code. You will need to create a data file that contains the number of cities and the $(x,\,y)$ coordinates of each city. I've set the maximum number of cities to 1000, but you need to use at least enough cities to be able to measure the time of the algorithm, that is, the initial running time t should be at least, say 40 seconds. Also, the code starts the tour at the last city cn-1 rather than the first city c0 (this is not a significant change in the problem). The code simply prints out the tour, it does not determine if it is less than a maximal length M, but this could easily added to the code. Finally, you may translate the code into any other language, just be certain you implement the same algorithm correctly.

#include <stdio.h>
#include <math.h>

#define TRUE (1)
#define FALSE (0)

#define MAX_CITIES (1000)
#define MAX_DIST   (1000)

typedef int bool;
typedef struct location {double x; double y;} loc;

int   number_of_cities;
loc PtArr[MAX_CITIES];

main()
{
    int i;
    void NearNeighborTour();

    scanf("%d\n", &number_of_cities);
    if (number_of_cities > MAX_CITIES) {
        fprintf(stderr, "error: too many cities\n");
        exit(1);
    }
    for (i = 0; i < number_of_cities; i++) {
         scanf("%lf %lf\n", &PtArr[i].x, &PtArr[i].y);
    }
    NearNeighborTour();
}

void NearNeighborTour()
{
    int i, j;
    bool visited[MAX_CITIES];
    int this_city;
    int closest_city;
    double closest_distance;
    double distance();

    /* initialize unvisited cities */
    for (i = 0; i < number_of_cities; i++) {
         visited[i] = FALSE;
    }

    /* choose number_of_cities as starting point */
    this_city = number_of_cities - 1;
    visited[this_city] = TRUE;
    printf("First city is %d\n", this_city);

    /* main loop of nearest neighbor heuristic */
    for (i = 1; i < number_of_cities; i++) {
        /* find nearest unvisited city to this city */
        closest_distance = MAX_DIST;
        for (j = 0; j < number_of_cities; j++) {
             if (!visited[j]) {
                 if (distance(this_city, j) < closest_distance) {
                     closest_distance = distance(this_city, j);
                     closest_city = j;
                 }
             }
        }
        /* report closest city */
        printf("Move from %d to %d\n", this_city, closest_city);
        visited[closest_city] = TRUE;
        this_city = closest_city;
    }

    /* finish tour by returning to start */
    printf("Move from %d to %d\n", this_city, number_of_cities - 1);
}

double distance(m, n)
int m, n;
{
    double x_squared = (PtArr[m].x-PtArr[n].x)*(PtArr[m].x-PtArr[n].x);
    double y_squared = (PtArr[m].y-PtArr[n].y)*(PtArr[m].y-PtArr[n].y);

    return sqrt(x_squared + y_squared);
}

Use of a Profiler

To determine where to put forth an effort to optimize the code you should profile your code. The gcc compiler supports profiling with prof and gprof, see the manual pages for gcc.

Hints on Increasing the Code's Efficiency

I'd like to delay giving hints on how to increase the code's efficiency. We can discuss ideas over the mail (cse5081@cs.fit.edu). There are several ideas that should become obvious to you and a few others that are more obscure.

What to Turn In

You must turn in:

1.
the original program listing,
2.
a listing for each change you make to the code,
3.
a report detailing the timing of the original and changed code with an explanation of why the changes were made and the speedup obtained by the change,
4.
an directory on tuck where you source programs, header files, cities file, etc., can be found.

You should make one improvement at a time using conditional compilation to exclude old portions of the program and including new portions, for example,

#if IMPROVEMENT == NONE
original code goes here
#elif IMPROVEMENT == FIRST
first improvement goes here
#endif

Your report can be at most 5 pages of double spaced 10 point type and will be presented at the Annual Computer Conference on Code Optimization which will be held on June 3, 1997.

Bibliography



William D. Shoaff
2000-11-13