IBM Books

Hitchhiker's Guide


So Long And Thanks For All The Fish

So far, we've talked about getting PE working, creating message passing parallel programs, debugging problems and debugging parallel applications. When we get a parallel program running so that it gives us the correct answer, we're done. Right? Not necessarily. In this area parallel programs are just like sequential programs; just because they give you the correct answer doesn't mean they're doing it in the most efficient manner. For a program that's relatively short running or is run infrequently, it may not matter how efficient it is. But for a program that consumes a significant portion of your system resources, you need to make the best use of those resources by tuning its performance.

In general, performance tuning can be a complicated task. The Hitchhiker's Guide to the Galaxy tells us that dolphins were the most intelligent life form on the Earth. When the Earth was being destroyed to make way for an Interstellar Bypass, it seems fitting that the dolphins were the only species to leave the planet. Their parting words, to a few select humans, were "So long and thanks for all the fish". The people that tune the performance of applications regard the original developers of their code in much the same way (even if they, themselves, were the original developers); they appreciate the working code, but now they've got more complex things to do, that are significantly different than code development.


Tuning the Performance of a Parallel Application

There are two approaches to tuning the performance of a parallel application.

Note:It may not be possible to use some tools in a parallel environment in the same way that they're used in a sequential environment. This may be because the tool requires root authority and POE restricts the root ID from running parallel jobs. Or, it may be because, when the tool is run in parallel, each task attempts to write into the same files, thus corrupting the data. tprof is an example of a tool that falls into both of these categories.

A report from Cornell University found that the performance results obtained by these techniques yielded comparable results. The difference was in the tools that were used in each of the approaches, and how they were used.

With either approach, you use the standard sequential tools in the traditional manner. When an application is tuned and then parallelized, you need to observe the communication performance, how it affects the performance of each of the individual tasks, and how the tasks affect each other. For example, does one task spend a lot of time waiting for messages from another? If so, perhaps the workload needs to be rebalanced. Or if a task starts waiting for a message long before it arrives, perhaps it could do more algorithmic processing before waiting for the message. When an application is parallelized and then tuned, you need a way to collect the performance data in a manner that includes both communication and algorithmic information. That way, if the performance of a task needs to be improved, you can decide between tuning the algorithm or tuning the communication.

This section will not deal with standard algorithmic tuning techniques. Rather, we will discuss some of the ways the PE can help you tune the parallel nature of your application, regardless of the approach you take. To illustrate this, we'll use two examples.


How Much Communication is Enough?

A significant factor that affects the performance of a parallel application is the balance between communication and workload. In some cases, the workload is unevenly distributed or is duplicated across multiple tasks. Ideally, you'd like perfect balance among the tasks but doing so may require additional communication that actually makes the performance worse. We discussed this briefly in "Duplication Versus Redundancy" when we said that sometimes it's better to have all the tasks do the same thing rather than have one do it and try to send the results to the rest.

An example of where the decision is not so clear cut is the matrix inversion program in "The Answer is 42". We showed you how to start making the sequential program into a parallel one by distributing the element calculation once the determinant was found. What we didn't show you was how poor a start this actually is. Part of the program is shown below. You can access the complete program from the IBM RS/6000 World Wide Web site. See "Getting the Books and the Examples Online" for more information.

*************************************************************************
*
* Matrix Inversion Program - First parallel implementation
*
* To compile:
* mpcc -g -o inverse_parallel inverse_parallel.c
*
*************************************************************************
     {
/* There are only 2 unused rows/columns left */
 
/* Find the second unused row */
for(row2=row1+1;row2<size;row2++)
  {
    for(k=0;k<depth;k++)
      {
        if(row2==used_rows[k]) break;
      }
    if(k>=depth)  /* this row is not used */
      break;
  }
assert(row2<size);
 
/* Find the first unused column */
for(col1=0;col1<size;col1++)
  {
    for(k=0;k<depth;k++)
      {
         if(col1==used_cols[k]) break;
      }
    if(k>=depth)  /* this column is not used */
      break;
  }
assert(col1<size);
 
/* Find the second unused column */
for(col2=col1+1;col2<size;col2++)
  {
    for(k=0;k<depth;k++)
      {
        if(col2==used_cols[k]) break;
      }
    if(k>=depth)  /* this column is not used */
      break;
  }
assert(col2<size);
 
/* Determinant = m11*m22-m12*m21 */
return matrix[row1][col1]*matrix[row2][col2]-matrix
[row1][col2]*matrix[row2][col1];
        }
 
        /* There are more than 2 rows/columns in the matrix being processed  */
        /* Compute the determinant as the sum of the product of each element */
        /* in the first row and the determinant of the matrix with its row   */
        /* and column removed                                                */
        total = 0;
 
        used_rows[depth] = row1;
        for(col1=0;col1<size;col1++)
          {
            for(k=0;k<depth;k++)
              {
                if(col1==used_cols[k]) break;
              }
             if(k<depth)  /* This column is used -- skip it*/
               continue;
            used_cols[depth] = col1;
            total += sign*matrix[row1][col1]*determinant(matrix,size,used_rows,
            used_cols,depth+1);
            sign=(sign==1)?-1:1;
          }
        return total;
 
  }
 
void print_matrix(FILE * fptr,float ** mat,int rows, int cols)
{
  int i,j;
  for(i=0;i<rows;i++)
    {
      for(j=0;j<cols;j++)
        {
          fprintf(fptr,"%10.4f ",mat[i][j]);
        }
      fprintf(fptr,"\n");
    }
}
 
float coefficient(float **matrix,int size, int row, int col)
{
  float coef;
  int * ur, *uc;
 
  ur = malloc(size*sizeof(matrix));
  uc = malloc(size*sizeof(matrix));
  ur[0]=row;
  uc[0]=col;
  coef = (((row+col)%2)?-1:1)*determinant(matrix,size,ur,uc,1);
  return coef;
}

To illustrate the problem, run the parallel matrix inversion program, inverse_parallel.c with tracing turned on, and then run VT on the resulting trace file.

$ mpcc -g -o inverse_parallel inverse_parallel.c
$ inverse_parallel -tracelevel 9 -procs 9
$ vt -tracefile inverse_parallel.trc

Select the Interprocessor Communication, Source Code, Message Matrix and the System Summary displays and begin playing the trace file. You'll see a significant amount of CPU activity, both kernel and user, with very little idle of wait time on the System Summary display but no communication activity for a while. Finally, you'll see messages send back to the last task as each server sends its results back. So what's going on? Because VT only records source code location for message library calls, we can't use it to locate where the time is being spent. However, we can go back to gprof.

Recompile inverse_parallel.c with the -pg flag and re-run it. Run gprof on the monitor files for tasks 0-7 (we know task 8 just collects the results so we aren't that concerned with its performance).

$ mpcc -g -pg -o inverse_parallel inverse_parallel.c
$ inverse_parallel -tracelevel 9 -procs 9
$ gprof inverse_parallel gmon.out[0-7]
 
%   cumulative   self              self     total
time   seconds  seconds     calls  ms/call  ms/call  name
80.3       9.81     9.81       72   136.25   136.25  .determinant [1]
8.8      10.89     1.08                             .__mcount [5]
3.6      11.33     0.44                             .kickpipes [6]
0.9      11.44     0.11                             .readsocket [7]
0.0      12.21     0.00       64     0.00   136.25  .coefficient [4]
0.0      12.21     0.00        8     0.00  1226.25  .main [2]

We see that we spend a lot of time in determinant, first to compute the determinant for the entire matrix and then in computing the determinant as part of computing the element values. That seems like a good place to start optimizing.

This algorithm computes the determinant of a matrix by using the determinants of the submatrices formed by eliminating the first row and a column from the matrix. The result of this recursion is that, eventually, the algorithm computes the determinants of all the 2 by 2 matrices formed from the last two rows and each combination of columns. This isn't so bad but the same 2 by 2 matrix formed in this manner is computed n-2 times (once for each column except the 2 from which it is formed) each time a determinant is computed and there are n*(n-1)/2 such matrices. If the 2 by 2 matrix determinants can be captured and re-used, it would provide some improvements.

Not only is this a good approach for optimizing a sequential program, but parallelism capitalizes on this approach as well. Because the 2 by 2 determinants are independent, they can be computed in parallel and distributed among the tasks. Each task can take one of the columns and compute the determinants for all the matrices formed by that column and subsequent columns. Then the determinants can be distributed among all the tasks and used to compute the inverse elements.

The following example shows only the important parts of the program. You can access the complete program from the IBM RS/6000 World Wide Web site. See "Getting the Books and the Examples Online" for more information.

Here's the call to partial determinant:

/************************************************************************
*
* Matrix Inversion Program - First optimized parallel version
*
* To compile:
* mpcc -g -o inverse_parallel_fast inverse_parallel_fast.c
*
************************************************************************/
 
  /* Compute determinant of last two rows */
  pd = partial_determinant(matrix,rows);
  /* Everyone computes the determinant (to avoid message transmission) */
  determ=determinant(matrix,rows,used_rows,used_cols,0,pd);

And here's the partial determinant call:

 
}
 
/* Compute the determinants of all 2x2 matrices created by combinations */
/* of columns of the bottom 2 rows                                      */
/* partial_determinant[i] points to the first determinant of all the 2x2*/
/* matricies formed by combinations with column i.  There are n-i-1     */
/* such matricies (duplicates are eliminated)                           */
float **partial_determinant(float **matrix,int size)
{
  int col1, col2, row1=(size-2), row2=(size-1);
  int i,j,k;
  int terms=0;
  float **partial_det,  /* pointers into the 2x2 determinants*/
                        /* by column                         */
        *buffer,        /* the 2x2 determinants              */
        *my_row;        /* the determinants computed by this */
                        /* task                              */
  int * recv_counts, * recv_displacements; /* the size and offsets for the */
                                           /* determinants to be received from  */
                                           /* the other tasks        */
 
  terms = (size-1)*(size)/2;  /* number of combinations of columns */
 
  /* Allocate work areas for paritial determinants and message passing, */
  partial_det = (float **) malloc((size-1)*sizeof(*partial_det));
  buffer      = (float *)  malloc(terms*sizeof(buffer));
  my_row      = (float *)  malloc((size-me-1)*sizeof(my_row));
  recv_counts = (int *)    malloc(tasks*sizeof(*recv_counts));
  recv_displacements = (int *) malloc(tasks*sizeof(*recv_displacements));
 
  /* the tasks after the column size - 2 don't have to do anything */
  for(i=tasks-1;i>size-2;i--)
    {
        recv_counts[i]=0;
        recv_displacements[i]=terms;
    }
  /* all the other tasks compute the determinants for combinations */
  /* with its column                                               */
  terms--;
  for(i=size-2;i>=0;i--)
    {
        partial_det[i]=&(buffer[terms]);
        recv_displacements[i]=terms;
        recv_counts[i]=size-i-1;
        terms-=(size-i);
    }
  for(j=0;j<(size-me-1);j++)
    {
      my_row[j]=matrix[row1][me]*matrix[row2][me+j+1]
      -matrix[row1][me+j+1]*matrix[row2][me];
    }
 
  /* Now everybody sends their columns determinants to everybody else */
  /* Even the tasks that did not compute determinants will get the    */
  /* results from everyone else (doesn't sound fair, does it?)        */
  MPI_Allgatherv(my_row,
                 ((size-me-1)>0)?(size-me-1):0,
                 MPI_REAL,
                 buffernts,
                 recv_displacements,
                 MPI_REAL,MPI_COMM_WORLD);
 
  /* Free up the work area and return the array of pointers into the */
  /* determinants                                                    */
  free(my_row);
  return partial_det;
}

Now when we look at the VT trace for this version of the program, we can see the initial communication that occurs as the tasks cooperate in computing the partial determinants. The question is whether the cost of the additional communication offsets the advantage of computing the 2 by 2 determinants in parallel. In this example, it may not be because the small message sizes (the largest is three times the size of a float). Act) ipf:(compact)">s the matrix size increases, the cost of computing the 2 by 2 determinants will increase with the square of n (the size of the matrix) but the cost of computing the determinants in parallel will increase with n (each additional dimension increases the work of each parallel task by only one additional 2 by 2 matrix) so, eventually, the parallel benefit will offset the communication cost.


Tuning the Performance of Threaded Programs

There are some things you need to consider when you want to get the maximum performance out of your program.

A program that has only one MPI communication thread may set the environment variable MP_SINGLE_THREAD=YES before calling MPI_Init. This will avoid some locking which is otherwise required to maintain consistent internal MPI state. The program may have other threads that do computation or other work, as long as they do not make MPI calls.

The MP_EUIDEVELOP environment variable lets you control how much checking is done when you run your program. Eliminating checking altogether (setting MP_EUIDEVELOP to NOC) provides performance (latency) benefits, but may cause critical information to be unavailable if your executable hangs due to message passing errors. For more information on MP_EUIDEVELOP and other POE environment variables, see IBM Parallel Environment for AIX: Operation and Use, Vol. 1.

For more information on profiling, see IBM Parallel Environment for AIX: Operation and Use, Vol. 2.

For more information on MPI tracing and visualization using VT, see IBM Parallel Environment for AIX: Operation and Use, Vol. 2. Support is included for thread safe tracing within the threaded MPI library and limited visualization using VT.


Why is this so slow?

So, you've got a serial program and you want it to execute faster. In this situation, it's best not to jump into parallelizing your program right away. Instead, you start by tuning your serial algorithm.

The program we'll use in this next example approximates the two-dimensional Laplace equation and, in our case, uses a 4-point stencil.

Our algorithm is very straight-forward; for each array element, we'll assign that element the average of the four elements that are adjacent to it (except the rows and columns that represent the boundary conditions of the problem).
Note:You may find it helpful to refer to In Search of Clusters by Gregory Phister for more information on this problem and how to parallelize it. See "Related Non-IBM Publications".

Note that the 4-point stencil program is central to this entire section, so you may want to spend some time to understand how it works.

The first step is to compile our serial program. However, before you do this, be sure you have a copy of stencil.dat in your program directory, or run the init program to generate one. Once we've done this, we can compile our serial program with the xlf command:

$ xlf -O2 naive.f -o naive

Next, we need to run the program and collect some information to see how it performs. You can use the UNIX time command to do this:

$ time naive

The following table shows the result:
Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
Note:The figures in the table above, as well as the others in this section, provide results that were gathered on an IBM RS/6000 SP. Your execution time may vary, depending on the system you're using.

Looking at these results, we can see that there's some room for improvement, especially if we scale the problem to a much larger array. So, how can we improve the performance?

Profile it

The first step in tuning our program is to find the areas within it that execute most of the work. Locating these compute-intensive areas within our program lets us focus on the areas that give us the most benefit from tuning. How do we find these areas? The best way to find them is to profile your program.

When we profile our program, we need to compile it with the -pg flag to generate profiling data, like this:

$ xlf -pg -O2 naive.f -o naive

The -pg flag compiles and links the executable so that when we run the program, the performance data gets written to output.

Now that we've compiled our program with the -pg flag, let's run it again to see what we get:

$ naive

This generates a file called gmon.out in the current working directory. We can look at the contents of gmon.out with the Xprofiler profiling tool. To start Xprofiler, we'll use the xprofiler command, like this:

$ xprofiler naive gmon.out

The Xprofiler main window appears, and in this window you'll see the function call tree. The function call tree is a graphical representation of the functions within your application and their inter-relationships. Each function is represented by a green, solid-filled box called a function box. In simple terms, the larger this box, the greater percentage of the total running time it consumes. So, the largest box represents the function doing the most work. The calls between functions are represented by blue arrows drawn between them call arcs. The arrowhead of the call arc points to the function that is being called. The function boxes and call arcs that belong to each library in your application appear within a fenced-in area called a cluster box. For the purposes of this section, we'll remove the cluster boxes from the display.

For more information on Xprofiler, see IBM Parallel Environment for AIX: Operation and Use, Vol. 2.

PLACE
the mouse cursor over the Filter menu.

CLICK
the left mouse button

* The Filter menu appears.

SELECT
the Remove All Library Calls option.

* The library calls disappear from the function call tree.

PLACE
the mouse cursor over the Filter menu.

CLICK
the left mouse button.

* The Filter menu appears.

SELECT
the Uncluster Functions option.

* The functions expand to fill the screen.

Locate the largest function box in the function call tree. We can get the name of the function by looking a little more closely at it:

PLACE
the mouse cursor over the View menu.

* The View menu appears.

PLACE
the mouse cursor over the Overview option.

CLICK
the left mouse button.

* The Overview Window appears.

Figure 11. figure caption

View figure.

The Overview Window includes a light blue highlight area that lets you zoom in and out of specific areas of the function call tree. To take a closer look at the largest function of naive:

PLACE
the mouse cursor over the lower left corner of the blue highlight area. You know your cursor is over the corner when the cursor icon changes to a right angle with an arrow pointing into it.

PRESS and HOLD
the left mouse button, and drag it diagonally upward and to the right (toward the center of the sizing box) to shrink the box. When it's about half its original size, release the mouse button.

* The corresponding area of the function call tree, in the main window, appears magnified.

If the largest function wasn't within the highlight area, it didn't get magnified. If this was the case, you'll need to move the highlight area:

PLACE
the left mouse button.

PRESS and HOLD
the left mouse button.

DRAG
the highlight area, using the mouse, and place it over the largest function. Release the mouse button.

* The largest function appears magnified in the function call tree.

Just below the function is its name, so we can now see that most of the work is being done in the compute_stencil() subroutine, so this is where we should focus our attention.

It's important to note that the programming style you choose can influence your program's performance just as much as the algorithm you use. In some cases, this will be clear by looking at the data you collect when your program executes. In other cases, you will know this from experience. There are many books that cover the subject of code optimization, many of which are extremely complex. See "Related IBM Publications".

As far as we're concerned, the goal here is not to use every optimization trick that's ever been thought up. Instead, we should focus on some basic techniques that can produce the biggest performance boost for the time and effort we spend.

The best way to begin is to look at our use of memory (including hardware data cache) as well as what we're doing in the critical section of our code. So.....let's look at our code:

iter_count = 0
100  CONTINUE
local_err = 0.0
iter_count = iter_count + 1
 
DO i=1, m-2
DO j=1, n-2
old_value = stencil(i,j)
 
stencil(i,j) = ( stencil(i-1, j ) +
1                          stencil(i+1, j ) +
2                          stencil( i ,j-1) +
3                          stencil( i ,j+1) ) / 4
local_err = MAX(local_err,ABS(old_value-stencil(i,j)))
END DO
END DO
IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, local_err
IF (close_enough.LT.local_err) GOTO 100
PRINT *, "convergence reached after ", iter_count, " iterations."

By looking at the two DO loops above, we can see that our compute subroutine is traversing our array first across rows, and then down columns. This program must have been written by some alien being from the planet C because Fortran arrays are stored in column major form rather than row major form.

The first improvement we should make is to reorder our loops so that they traverse down columns rather than across rows. This should provide a reasonable performance boost. Note that it's not always possible to change the order of loops; it depends on the data referenced within the loop body. As long as the values used in every loop iteration don't change when the loops are reordered, then it's safe to change their order. In the example we just looked at, it was safe to reorder the loops, so here's what the revised program looks like. Notice that all we did was swap the order of the loops.

 
DO j=1, n-2
DO i=1, m-2
old_value = stencil(i,j)
 

The second thing we should look at is the type of work that's being done in our loop. If we look carefully, we'll notice that the MAX and ABS subroutines are called in each iteration of the loop, so we should make sure these subroutines are compiled inline. Because these subroutines are intrinsic to our Fortran compiler, this is already done for us.

$ xlf -O2 reordered.f -o reordered

As before, we need to time our run, like this:

$ time reordered

And here are the results as compared to the original naive version:
Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38s 1000x1000

As you can see by the results, with just a small amount of analysis, we doubled performance. And we haven't even considered parallelism yet. However, this still isn't the performance that we want, especially for very large arrays (the CPU time is good, but the elapsed time is not).

Parallelize it

Now that we feel confident that our serial program is reasonably efficient, we should look at ways to parallelize it. As we discussed in "The Answer is 42", there are many ways to parallelize a program, but the two most commonly used techniques are functional decomposition and data decomposition. We'll focus on data decomposition only.

OK, so how do I decompose my data? Let's start by dividing the work across the processors. Each task will compute a section of an array, and each program will solve



View figure.

of the problem when using n processors.

Here's the algorithm we'll use:

The section of code for our algorithm looks like this:

      iter_count = 0
 100  CONTINUE
         local_err = 0.0
         iter_count = iter_count + 1
         CALL exchange(stencil, m, n)
 
         DO j=1, n-2
            DO i=1, m-2
               old_value = stencil(i,j)
 
               stencil(i,j) = ( stencil(i-1, j ) +
     1                          stencil(i+1, j ) +
     2                          stencil( i ,j-1) +
     3                          stencil( i ,j+1) ) / 4
 
               local_err = MAX(local_err,ABS(old_value-stencil(i,j)))
           END DO
        END DO
        CALL MPI_Allreduce(local_err, global_error, 1, MPI_Real,
     1      MPI_Max, MPI_Comm_world, ierror)
 
        IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, global_error
      IF (close_enough.LT.global_error) GOTO 100
      PRINT *, "convergence reached after", iter_count, "iterations."

Now, let's compile our parallelized version:

$ mpxlf -02 chaotic.f -o chaotic

Next, let's run it and look at the results:

$ export MP_PROCS=4
$ export MP_LABELIO=yes
$ time poe chaotic

Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38s 1000x1000
chaotic 4 (processors) 2m4.58s 500x500

The results above show that we more than doubled performance by parallelizing our program. So...we're done, right? Wrong! Since we divided up the work between four processors, we expected our program to execute four times faster. Why doesn't it? This could be due to one of several factors that tend to influence overall performance:

We'll look at some of these factors later. Right now we need to be asking ourselves something more important; does the parallel program get the same answer?

The algorithm we chose gives us a correct answer, but as you will see, it doesn't give us the same answer as our serial version. In practical applications, this may be OK. In fact, it's very common for this to be acceptable in Gauss/Seidel chaotic relaxation. But what if it's not OK? How can we tell? What methods or tools can be used to help us diagnose the problem and find a solution?

Let's take a look...

Wrong answer!

We've now invested all this time and energy in parallelizing our program using message passing, so why can't we get the same answer as the serial version of the program? This is a problem that many people encounter when parallelizing applications from serial code and can be the result of algorithmic differences, program defects, or environment changes.

Both the serial and parallel versions of our program give correct answers based on the problem description, but that doesn't mean they both can't compute different answers! Let's examine the problem more closely by running the chaotic.f program under the pedb debugger:

$ pedb chaotic

Figure 12. pedb main window

View figure.

By looking at the main program, we can see that both versions of our program (reorder.f and chaotic.f) read in the same data file as input. And after we initialize our parallel environment, we can see that the compute_stencil subroutine performs exactly the same step in order to average stencil cells.

Let's run each version under the control of the debugger to view and compare the results of our arrays.

With this test, we will be looking at the upper left quadrant of the entire array. This allows us to compare the array subset on task 0 of the parallel version with the same subset on the serial version.

Here's the serial (reordered) array and parallel (chaotic) array stencils:

Figure 13. Serial and Parallel Array Stencils

View figure.

In chaotic.f, set a breakpoint within the call compute_stencil at line 168.

PLACE
the mouse cursor on the line 168.

DOUBLE CLICK
the left mouse button.

* The debugger sets a breakpoint at line 168. A stop marker (drawn to look like a stop sign containing an exclamation point) appears next to the line.

Figure 14. Setting Breakpoint

View figure.

Note:After you do this, all tasks should have a breakpoint set at line at 168.

Continue to execute the program up to the breakpoints. The program counter should now be positioned at line 168.

Next, we'll need to examine the array stencil. We can do this by exporting the array to a file to compare answers.

First, select the variable stencil on task 0.

PLACE
the mouse cursor over the stencil variable.

CLICK
the left mouse button to highlight the selection.

Figure 15. Variable List

View figure.

Bring up the Variable Options menu.

CLICK
the right mouse button.

* The Variable Options menu appears.

Figure 16. Variable Options Menu

View figure.

SELECT
Export to File

* The Export window appears.

Figure 17. Export Window

View figure.

We'll be exporting the subrange 0 through 500 in both dimensions.

Next, rename the file to be created to chaotic.hdf so that we'll be able to differentiate our results from the two runs. Then, press the Export button to export the array to an HDF file. Wait for the export to complete and then exit the debugger.

As you can see, we now have a file called chaotic.hdf in our current working directory.

Remember that our objective is to compare each of the results, so we should set our number of processes to 1 and then run the reordered program in the same way as we did with chaotic.
Note:Remember to only export array indices 0 through 500 in each dimension to get a valid comparison. Also, remember to rename the output file to reorder.hdf.

And now for something completely different...let's take a look at our HDF files using IBM Visualization Data Explorer (this is a separate program product, available from IBM). Here's what we see:

Figure 18. Visualizing Reordered HDF File

View figure.

Figure 19. Visualizing Chaotic HDF File

View figure.

Well, they look the same, but how do we really know? To find out, we can look at the reordered and chaotic raw data. They appear to be the same, but if we look more closely, we can see some differences. Let's take a look at a section of row 2 in the raw data of each.

Here's the reordered data:

(row, col)
 (2,310)  (2,311)  (2,312)  (2,313)  (2,314)  (2,315)
9.481421 9.440039 9.398028 9.355416 9.312231 9.268500
 
 (2,316)  (2,317)  (2,318)  (2,319)  (2,320)  (2,321)
9.224252 9.179513 9.134314 9.088681 9.042644 8.996230

Here's the chaotic data:

(row, col)
 (2,310)  (2,311)  (2,312)  (2,313)  (2,314)  (2,315)
9.481421 9.440039 9.398028 9.355416 9.312232 9.268501
 
 (2,316)  (2,317)  (2,318)  (2,319)  (2,320)  (2,321)
9.224253 9.179514 9.134315 9.088682 9.042645 8.996231

After looking at the raw data, we see that our answers are definitely similar, but different. Why? We can blame it on a couple of things, but it's mostly due to the chaotic nature of our algorithm. If we look at how the average is being computed in the serial version of our program, we can see that within each iteration of our loop, two array cells are from the old iteration and two are from new ones.

Figure 20. How the average is computed in a 4-point stencil

View figure.

Another factor is that the north and west borders contain old values at the beginning of each new sweep for all tasks except the northwest corner. The serial version would use new values in each of those quadrants instead of old values. In the parallel version of our program, this is true for the interior array cells but not for our shared boundaries. For more information, you may find In Search of Clusters by Gregory F. Phister, helpful. See "Related Non-IBM Publications".

OK, now that we know why we get different answers, is there a fix?

Here's the Fix!

So, you've got a serial and parallel program that don't give you the same answers. One way to fix this is to skew the processing of the global array. We skew the processing of the array, computing the upper left process coordinate first, then each successive diagonal to the lower right process coordinate. Each process sends the east and south boundary to its neighboring task.

Figure 21. Sequence of Array Calculation

View figure.

The only thing we need to modify in our new program is the message passing sequence. Prior to the compute_stencil() subroutine, each task receives boundary cells from its north and west neighbors. Each task then sends its east and south boundary cells to its neighbor. This guarantees that the array cells are averaged in the same order as in our serial version.

Here's our modified (skewed) parallel program. It's called skewed.f.

      iter_count = 0
 100  CONTINUE
         local_err = 0.0
         iter_count = iter_count + 1
         CALL exch_in(stencil, m, n)
 
         DO j=1, n-2
            DO i=1, m-2
               old_value = stencil(i,j)
 
               stencil(i,j) = ( stencil(i-1, j ) +
     1                          stencil(i+1, j ) +
     2                          stencil( i ,j-1) +
     3                          stencil( i ,j+1) ) / 4
 
               local_err = MAX(local_err,ABS(old_value-stencil(i,j)))
            END DO
         END DO
 
         CALL exch_out(stencil, m, n)
         CALL MPI_Allreduce(local_err, global_error, 1, MPI_Real,
     1      MPI_Max, MPI_Comm_world, ierror)
 
         IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, global_error
      IF (close_enough.LT.global_error) GOTO 100
      PRINT *, "convergence reached after", iter_count, "iterations."

Now let's run this new version and look at the results:

$ time poe skewed

Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38s 1000x1000
chaotic 4 (processors) 2m4.58s 500x500
skewed 4 (processors) 4m41.87s 500x500

If we do the same array comparison again we can see that we do indeed get the same results. But, of course, nothing's that easy. By correcting the differences in answers, we slowed down execution significantly, so the hidden cost here is time. Hmm... now what do we do?

It's Still Not Fast Enough!

We've got the right answers now, but we still want our program to move faster, so we're not done yet. Let's look at our new code to see what other techniques we can use to speed up execution. We'll look at:

Let's use VT to see how our program executes:

$ skewed -tracelevel 3
Note:We use trace level 3 to look at message passing traits.

$ vt  skewed.trc

Figure 22. VT Space Time Diagram

View figure.

By looking at the message passing, we can see some peculiar characteristics of our program. For instance, notice that many of the processors waste time by waiting for others to complete before they continue. These kinds of characteristics lead us to the conclusion that we've introduced very poor load balancing across tasks. One way to alleviate this problem is to allow some processors to work ahead if they can deduce that another iteration will be necessary in order to find a solution.

If a task's individual max is large enough on one iteration to force the global max to reiterate across the entire array, that task may continue on the next iteration when its west and north boundaries are received.

To illustrate this, we'll use the pipelined.f program.

      iter_count = 0
      local_err = close_enough + 1
 100  CONTINUE
         iter_count = iter_count + 1
         CALL exch_in(stencil, m, n, local_err, global_err,
     1      iter_count, close_enough)
 
         IF (MAX(global_err,local_err).GE.close_enough) THEN
            local_err = 0.0
            DO j=1, n-2
               DO i=1, m-2
                  old_val = stencil(i,j)
 
                  stencil(i,j) = ( stencil( i-1, j ) +
     1                             stencil( i+1, j ) +
     2                             stencil( i ,j-1) +
     3                             stencil( i ,j+1) ) / 4
 
                  local_err = MAX(local_err, ABS(old_val-stencil(i,j)))
               END DO
            END DO
         END IF
 
         CALL exch_out(stencil, m, n, global_err, local_err)
 
         IF(MOD(iter_count,100).EQ.0)PRINT *, iter_count, global_err
     IF (MAX(global_err,local_err).GE.close_enough) GOTO 100
     PRINT *, "convergence reached after", iter_count, "iterations."

As you can see on the line:

IF(MAX(global_err,local_err).GE.close_enough) THEN

the program checks to see if the value of local_err is enough to allow this task to continue on the next iteration.

Now that we've made these improvements to our program, we'll most likely see improvement in our load balance as well.

Let's see how we affected our load balance by running our new code with tracing turned on.

$ time poe pipelined

Our new code shows the following message passing graph:

Figure 23. VT Displays

View figure.

Now, let's run our new code to see how this new version fares.

$ time poe pipelined

Program Name Tasks Wallclock Time Array Size per Task
naive 1 (single processor) 11m1.94s 1000x1000
reordered 1 (single processor) 5m35.38ss 1000x1000
chaotic 4 (processors) 2m4.58s 500x500
skewed 4 (processors) 4m41.87s 500x500
pipelined 4 (processors) 2m7.42s 500x500

So, how did we do? Here are the final statistics:

As you can see, we were able to significantly improve the performance of our program and, at the same time, get a consistent, correct answer. And all we had when we started was our serial program!


Tuning Summary

As you've probably deduced, tuning the performance of a parallel application is no easier than tuning the performance of a sequential application. If anything, the parallel nature introduces another factor into the tuning equation. The approach the IBM Parallel Environment for AIX has taken toward performance tuning is to provide tools which give you the information necessary to perform the tuning. We've given you the fish. Where you go from there is up to you.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]