Profiling to tune C programs on Alpha

Paul J. Drongowski
Hewlett Packard Company
110 Spit Brook Road
Nashua, New Hampshire 03062

6 April 2001
Copyright © 2001 Hewlett Packard Company

1.0 Introduction

Performance tuning is the process of identifying and removing bottlenecks. Bottlenecks can arise from many different sources. Here are but a few examples.

The executable program, as generated by the compiler, may not fully exploit the resources of the base machine, such as parallel computation by multiple arithmetic/logic units.
The program may not exhibit good spatial/temporal locality when accessing memory.
The program may wait on the completion of disk or network input/output activity.
One or more threads in a multi-threaded programs may be waiting unnecessarily due to a synchronization problem or poor message-based communication.

This paper describes the application of the Compaq Continuous Profiling Infrastructure (DCPI/ProfileMe), a powerful tool for the analysis of program execution on Tru64/Alpha. In particular, DCPI/ProfileMe assists the analysis of the CPU and memory system behavior of a program. It does not address input/output or thread communication and synchronization behavior.

1.1 DCPI/ProfileMe is easy to use

DCPI/ProfileMe runs in the background and collects performance information about a program as it executes. In fact, DCPI/ProfileMe collects information about all running processes, giving the programmer a system-wide view. This broad view assists the analysis of applications that may be distributed across two or more processes.

Here is a short example of DCPI/ProfileMe in action. It shows how to run DCPI/ProfileMe and the kind of information produced by the its analysis. We'll run and analyze the following program, which computes the value of e raised to the x power for x between 0.0 and 10.0. This is not a particularly good implementation of the Taylor series approximation, but it should burn about one minutes worth of execution time. (The execution platform is a 667MHz XP-1000 workstation with 256Mbytes of primary memory and a 4Mbyte level 2 cache, running the Tru64 UNIX operating system.) The function main consists of a loop that calls e_to_the_x with successive values of x. The function e_to_the_x evaluates a Taylor series polynomial with 40 terms.

#include <stdio.h>

#define NTERMS              40
#define NITERATIONS  100000000

double e_to_the_x(double x)
{
  long int i ;
  double factorial, term, result ;

  result = 1.0 ;
  factorial = 1.0 ;
  term = 1.0 ;

  for (i = 1 ; i <= NTERMS ; i++) {
    term = term * x ;
    factorial = factorial * (double) i ;
    result = result + (term / factorial) ;
  }

  return( result ) ;
}

main(int argc, char *argv[])
{
  int i ;
  double ex, x, step ;

  x = 0.0 ;
  step = 10.0 / ((double) NITERATIONS) ;
  for (i = 0 ; i <= NITERATIONS ; i++) {
    ex = e_to_the_x(x) ;
    if ((i == 0) || (i == NITERATIONS)) {
      printf("e^%f: %f\n", x, ex) ;
    }
    x = x + step ;
  }
}

The session transcription sets the environment variable DCPIDB to the path to the DCPI measurement database, starts the DCPI/ProfileMe collection system, runs the program (taylor), and stops data collection.

1 > setenv DCPIDB /dsk0h/dcpidb
2 > dcpid -event pm $DCPIDB         
dcpid: load   : loaded pcount driver
dcpid: monitoring pm
dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log
3 > taylor
e^0.000000: 1.000000
e^10.000000: 22026.465709
4 > dcpiquit
dcpictl: quit successful

Collecting data is just that easy. Data collection adds about 2 to 3 percent overhead to program execution. It does not require recompilation/relinking, nor does it invade the program in any way.

DCPI/ProfileMe provides several tools for analyzing and displaying performance information. This paper concentrates on the use of two of these tools, dcpiprof and dcpilist. The basic analytical process involves:

Examining an overview of program performance to find potential trouble spots or "hot," frequently executed parts of the program, and
Narrowing the focus to particular components (images and procedures) within the program.

The dcpiprof transcription below shows an overview of the execution of taylor. DCPI/ProfileMe collects and summarizes hardware events that occur during program execution. The summary below shows the distribution of retired (completed) instructions across the program images that were active during the measurement period. The program image taylor has the largest number of retired instructions indicating that it was the busiest component and is a likely candidate for further investigation.

5 > dcpiprof -i -pm retired
Column                Total  Period (for events)
------                -----  ------
retired:count        815091   63488

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
retired                
 :count       %    cum% image
 667207  81.86%  81.86% taylor
 147151  18.05%  99.91% /vmunix
    351   0.04%  99.95% /dsk0h/dcpidb/PALcode
    178   0.02%  99.97% /usr/bin/dcpid
     79   0.01%  99.98% /usr/shlib/libXt.so
     49   0.01%  99.99% /usr/shlib/libc.so
     18   0.00%  99.99% /usr/shlib/libX11.so
     15   0.00%  99.99% /usr/shlib/libpthread.so
     11   0.00% 100.00% /usr/shlib/X11/libdix.so
      8   0.00% 100.00% /usr/shlib/libXm.so
      5   0.00% 100.00% unknown@positron
      4   0.00% 100.00% /usr/shlib/libmach.so
      3   0.00% 100.00% /usr/shlib/X11/libos.so
      2   0.00% 100.00% /usr/shlib/X11/lib_dec_ws.so
      2   0.00% 100.00% /usr/sbin/routed
      2   0.00% 100.00% /usr/sbin/xntpd
      2   0.00% 100.00% /sbin/loader
      1   0.00% 100.00% /bin/sh
      1   0.00% 100.00% /usr/dt/bin/dtterm
      1   0.00% 100.00% /usr/shlib/X11/libmi.so
      1   0.00% 100.00% /usr/bin/csh

The dcpiprof tool is also used to produce a procedure-by-procedure breakdown of DCPI/ProfileMe events for an image. The excerpt below shows the breakdown of retired instructions for each of the functions in taylor. The function e_to_the_x is the most active procedure and is worth a closer look.

6 > dcpiprof -pm retired taylor
Column                Total  Period (for events)
------                -----  ------
retired:count        667207   63488

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
retired                
 :count       %    cum% procedure                 image
 646743  96.93%  96.93% e_to_the_x                taylor
  20464   3.07% 100.00% main                      taylor

The dcpilist program shows the instruction-by-instruction breakdown of DCPI/ProfileMe events for a particular procedure. dcpilist displays the breakdown as an annotated assembler (or source language) listing. The annotations show the number of selected DCPI/ProfileMe events associated with each instruction. The transcription below shows the number of retire events and retire delay for each instruction in e_to_the_x. An instruction retires when it successfully executes to completion in the CPU pipeline. An instruction's retire delay is a lower bound on the number of CPU cycles that the instruction has delayed its own retirement and the retirement of other instructions following it in program order. High retire delay may indicate the presence of a performance bottleneck.

7 > dcpilist -pm retired+retired:retdelay e_to_the_x taylor
e_to_the_x:
retired   retired 
 :count :retdelay  freq   
      0         0  1122        0x120001160   ldah    gp, 8192(t12)
      0         0  1122        0x120001164   ldq_u   zero, 0(sp)
      0         0  1122        0x120001168   lda     gp, 28448(gp)
      0         0  1122        0x12000116c   ldq_u   zero, 0(sp)
   1561         0  1122        0x120001170   ldah    at, -8192(gp)
   1604         0  1122        0x120001174   lda     sp, -16(sp)
   1661         5  1122        0x120001178   lds     $f1, -28888(at)
   1587         0  1122        0x12000117c   bis     zero, 0x1, v0
   1589         0  1122        0x120001180   cpys    $f1,$f1,$f0
   1567         0  1122        0x120001184   cpys    $f1,$f1,$f10
   1534         0  1122        0x120001188   ldq_u   zero, 0(sp)
   1526      1526  1122        0x12000118c   ldq_u   zero, 0(sp)
  63089         0 63103        0x120001190 : stq     v0, 0(sp)
  63134         0 63103        0x120001194   mult    $f1,$f16,$f1
  63248         0 63103        0x120001198   lda     v0, 1(v0)
  63105         1 63103        0x12000119c   cmple   v0, 0x28, t0
  62998         0 63103        0x1200011a0   ldt     $f11, 0(sp)
  62537         8 63103        0x1200011a4   cvtqt   $f11,$f11
  62995        30 63103        0x1200011a8   mult    $f10,$f11,$f10
  63413     63442 63103        0x1200011ac   divt    $f1,$f10,$f11
  63415    625262 63103        0x1200011b0   addt    $f0,$f11,$f0
  63097     63125 63103        0x1200011b4   bne     t0, 0x120001190
   1557         0  1122        0x1200011b8   lda     sp, 16(sp)
   1526         4  1122        0x1200011bc   ret     zero, (ra), 1

The relative distribution of event counts and counter values often indicates the presence of a performance bottleneck. In the example above, the high retire delay at address 0x120000900 is due to the relatively high latency of the floating point divide instruction (divt.) The floating point add instruction (addt) must wait for the quotient in register f10.

1.2 Who should read this paper

This paper is for experienced C programmers who would like to tune their programs for execution on Alpha. FORTRAN programmers with a reading knowledge of C will also benefit. The paper concentrates on the ProfileMe features as supported by contemporary Alpha processors like the 21264.

Performance analysis with DCPI/ProfileMe is not for the casual user. A programmer must be willing to explore the relationships between source code and compiled code, and between the program and the behavior of the Alpha processor and its associated memory system. A reading knowledge of Alpha assembler language is helpful. A programmer should at least be able to identify load, store, computational and branch instructions, and to follow the flow of data between instructions via the Alpha integer and floating point registers.

1.3 Roadmap

If you're ready to forge ahead, here is the roadmap for the rest of the paper. Section 2 is a detailed introduction to the behavior of the Alpha 21264 processing pipeline. The information in Section 2 is needed to understand the meaning of certain DCPI/ProfileMe hardware events and may be skipped on a first reading. Section 3 describes the way DCPI/ProfileMe collects performance information. It also contains several tables that summarize the kinds of events and information that DCPI/ProfileMe can collect.

After these preliminaries, Section 4 discusses the "mechanics" of using DCPI/ProfileMe in more detail. Section 4 should get you running DCPI/ProfileMe and experimenting with programs of your own. Section 5 digs into the interpretation of DCPI/ProfileMe data for specific Alpha processor and memory system features. Section 5 covers broad performance indicators, branch prediction behavior, replay traps, memory system behavior and data alignment in memory. The section concludes with a brief description of relatively rare behavioral issues.

2. Alpha 21264 pipeline

The Alpha 21264 processor achieves high performance by identifying and exploiting instruction level parallelism (ILP.) ILP is identified by analyzing dependencies between instructions. ILP is exploited by removing false dependencies and issuing instructions for execution when they are data-ready. Instructions may issue and execute out of program order although results are committed in program order to maintain program and architectural correctness. The 21264 provides several computational units (four integer and two floating point) that operate in parallel to perform computations concurrently when possible.

Let's illustrate this rather dense paragraph with a short example. The following assignment statement:

e ← (a + b) * (c + d)

may be compiled into the sequence of instructions:

addq     $16, $17, $20  # Where register $16=a and $17=b
addq     $18, $19, $21  # Where register $18=c and $19=d
mulq     $20, $21, $22  # Putting the result, e, into register $22

If the values of a, b, c, and d are available in their respective registers (that is, the data is ready), then the two additions could be performed concurrently on two separate functional units, assuming that the units are also available and ready. The multiply depends upon the results produced by the two additions and must be performed after the additions. The 21264 identifies the opportunity to execute the two additions in parallel and assigns the computational resources needed to do the job. It also recognizes that the data dependencies between the add instructions and the multiply instruction must be preserved and performs the multiply after the addition instructions.

Instructions pass through several stages of execution as they find their way through the 21264 pipeline. Alpha operate (register-to-register), load, and store instructions go through five common stages (stages 0 through 4) as shown in the table below. Processing for these main instruction types becomes quite different in subsequent stages (stage number 5 and beyond.)

Stage	Purpose
0	Instruction fetch
1	Instruction slotting
2	Register map
3	Instruction issue
4	Register read
5	Execution
...	Retirement

Here is a brief rundown on the common instruction stages. This is a simplified description and we will add details later when considering performance.

Stage 0: Instruction fetch. Four instructions are fetched from the instruction cache (I-cache) based on a predicted program counter value. A group of four such instructions is called a "quadpack" and is aligned on a 16-byte address boundary.
Stage 1: Instruction slotting. Instruction data is transferred from the I-cache to the integer and floating pointing register map hardware. Instruction slotting logic assigns integer instructions to a subcluster of integer execution (functional) units. The instructions will execute on the assigned subcluster.
Stage 2: Register map. Register specifiers within the instructions are mapped into a large pool of physical registers (80 integer and 72 floating point registers.) Mapping eliminates false dependencies and exposes potential ILP. Each instruction is assigned a unique 8-bit number, called an "inum," to identify the instruction and its program order with respect to other instructions. Mapped instructions and associated inums are put into integer and floating point instruction queues where they are made available to the next stage.
Stage 3: Instruction issue. Separate integer and floating point issue units select the oldest data-ready instructions for execution. An instruction is "older" if it was fetched in program order before another instruction. The integer issue unit selects four data-ready instructions from the 20 locations in the integer instruction queue and issues them to the integer execution units. The floating point issue unit selects two data-ready instructions from the 15 locations in the floating point instruction queue and issues them to the floating point execution units.
Stage 4: Register read. Issued instructions read their operands from the integer and floating point register files and receive bypass data. A bypass is a datapath that feeds data forward without writing it to the register file.
Stage 5: Execution. Begin instruction execution according to instruction type.

All instructions pass through one final stage -- retirement. An instruction is "retired" when the instruction and all preceding instructions (by program order) are past the point at which exceptions and mispredicts are detected. Retirement commits the results of the instruction to the architectural, program state and frees physical registers that are no longer needed. The 21264 can retire up to eleven instructions in a single cycle. Instructions are "in flight" from the time they are mapped to the time that they are retired.

Operate instructions are integer and floating point instructions that perform register-based operations. To achieve high speed, each computational cluster (two integer and one floating point) has its own register file. The two integer register files must be kept up to date and consistent by exchanging results. Operate instructions pass through the following sequence of stages:

Stage 5: Execution. The assigned execution unit computes results. This stage may require multiple clock cycles. Simple arithmetic and logical operations need only one cycle.
Stage 6: Write to register file. The results are written to the register file within the cluster and results are bypassed into the next execution stage within the cluster.
Stage 7: Write to cross-cluster register file. The results are written to the cross-cluster register file and are bypassed into the next execution stage in the other cluster.

Note that cross-cluster data transfers require an extra processor clock cycle to complete.

Due to resource constraints in the integer execution unit, up to two load instructions can be issued per clock cycle. Stages in load processing are:

Stage 5: Compute effective virtual address. Calculate the effective virtual address in the integer execution unit.
Stage 6: Access D-cache. Read data from the data cache.
Stage 7: Transfer D-cache data to execution units. Send the D-cache data to the functional units.
Stage 8: Use data. The data is available for use by a consuming instruction during this stage.

This scenario assumes that the data for the virtual address is found in the data cache (a D-cache hit.) If the load misses in the D-cache, then the sequence of stages is extended to map the virtual address to its physical address and to access data in the level 2 (L2) cache. If the load misses in the L2 cache, then the sequence of stages is further extended to access data in primary memory.

Three components help implement access to L2 cache and primary memory:

The 32-entry Load Queue holds (in program order) in flight load instructions and associated state.
The 32-entry Store Queue holds (in program order) in flight store instructions and associated state. The Store Queue holds an instruction until it is retired. At that point, the D-cache is updated.
The 8-entry Miss Address File (MAF) contains all of the in flight cache transactions that are pending on the L2 cache and primary memory. Each entry in the MAF describes a 64-byte block of data that will ultimately be written to the D-cache or I-cache (D-cache or I-cache "fills.")

The load queue, store queue and MAF let the 21264 execute beyond pending loads and stores without waiting for them to complete. If useful work can be performed by other instructions in the meantime, the memory latency can be "hidden." As many as 32 load operations and 32 store operations can be pending.

Store instructions are executed in the following way.

Stage 5: Compute effective virtual address. Calculate the effective virtual address in the integer execution unit.
Stage 6: Enqueue store instruction. Put the store instruction, address and data in the Store Queue until the store instruction is retired.

When a store instruction is retired, the new data is committed (written) to the D-cache.

Throughput depends upon a fast, smooth flow of instructions and data into, through and out of the processor pipeline. The 21264 incorporates features to increase and smooth the flow of data and instructions. For example, the 21264 fetches instructions aggressively, even by fetching through conditional branches. When a conventional (more precisely, "non-speculative") processor encounters a conditional branch instruction, it awaits the final outcome of the branch condition to see where the flow of control will go next before fetching new instructions. The 21264, however, uses branch prediction to speculate on the control flow through conditional branches and continues to prefetch instructions based upon its prediction.

Resource	Capacity
Integer instruction queue	20 entries
Floating point instruction queue	15 entries
Physical integer registers	80 registers
Physical floating point registers	72 registers
Integer functional units	4 units
Floating point functional units	2 units
Load queue	32 entries
Store queue	32 entries
Miss address file	8 entries
Instruction Translation Buffer	128 entries
Data Translation Buffer	128 entries

Alpha 21264 resource capacities.

Performance is lost when the smooth flow of instructions or data is disturbed. The flow may be disturbed in several ways.

Branch mispredictions. The branch predictor may not predict the direction of a conditional branch correctly. Any instructions that were speculatively fetched and executed must be discarded, the processor must roll back to the architectural state at the time of the branch, and instructions must be fetched from the correct target address.
Replay traps. Replay traps preserve program/architectural producer-consumer dependencies between load and store instructions. A replay trap is an internal processor mechanism and is not the same as a software interrupt. A replay is triggered when a load or store instruction cannot be executed due to a condition that occurs after that instruction has issued. The instruction, and all newer instructions, are aborted. The instruction is restarted at the fetch stage of the pipeline.
Translation buffer misses. The Data Translation Buffer (DTB) and Instruction Translation Buffer (ITB) provide quick look-aside buffers for virtual-to-physical address translation information. A fetch, load or store misses when the translation buffer does not have information for the target virtual memory page. The operating system must be called to put the needed information into the translation buffer.
Data or instruction not in cache. Data and instructions in the level 1 (L1) cache are quickly accessible (3 cycle latency for data.) The time to access data and instructions increases as access moves further way from the processor to higher levels of the memory hierarchy. Page faults are the worst costing many millions of lost cycles to satisfy a demand.
Unaligned load or store. When a quadword or longword load/store attempts a memory access at an unaligned address, the processor traps to the operating system. The operating system fixes up and completes the unaligned access, and restarts the program after the offending instruction.
Resource exhaustion. Resources are limited, and when a resource is exhausted, one or more stages of the pipeline must wait for a resource to become available again. The Store Queue, for example, is 32-entries long. If 32 store instructions are in flight, the 33rd store must stall until one of the first 32 store instructions completes and leaves the Store Queue. (The table above summarizes the most important capacities.)
Resource conflicts. Operations sometimes contend for the same resource. For example, some multimedia operations can only be performed in the upper unit of cluster zero in the integer execution unit. If the upper unit is occupied by a multimedia operation, then a second multimedia operation must wait for the first one to complete.
Arithmetic trap. An integer or floating point operation may cause a processor trap.

The severity of these conditions varies. The table below lists pipeline abort delays. This table tells only part of the story, however. The delay is only the time to clear and restart the pipeline. The delay does not include the time lost due to speculative results which were computed and then discarded. The delay does not include time spent in the operating system recovering from the ITB/DTB miss, arithmetic trap or unaligned memory access. Further, each penalty cycle is really one to six lost instruction issues.

Pipeline abort condition	Penalty (in cycles)
Branch misprediction	7
JSR misprediction	8
Memory order replay trap	14
Other memory replay traps	13
DTB miss	13
ITB miss	7
Integer arithmetic trap	12
Floating point arithmetic trap	13+(instruction latency)

Alpha 21264 pipeline abort delays.

3. DCPI and ProfileMe

The Compaq Continuous Profiling Infrastructure (DCPI) and ProfileMe collect and analyze program profile data. The profiles provide information about dynamic program behavior and can be used to detect and identify performance bottlenecks. ProfileMe is actually a part of DCPI, but is a necessary departure from the profiling facilities originally provided by DCPI. Although part of DCPI, we will refer to ProfileMe separately since it deals with different kinds of events.

Both DCPI and ProfileMe use a combination of hardware and software to collect and analyze performance data. The data is collected by sampling internal processor registers that count or monitor performance-related instruction events. The samples are written to a database and are organized by measurement epoch (the time period over which samples were collected), host processor, process and binary program image. Samples from the database are postprocessed by DCPI/ProfileMe tools. The tools analyze, summarize and present performance data by image, procedure (function), and instruction. The tools help a developer to find and drill down to hot spots and potential performance problems in a program.

Sampling has a number of advantages. First, sampling is non-invasive and is transparent. No instrumentation code needs to be compiled or inserted into a program to collect data. The application program can be run in the normal fashion and DCPI/ProfileMe collects data in the background. Overhead is low ranging from a 0.5% to 3.0% slowdown. Next, information can be collected about the entire system as it executes. DCPI/ProfileMe collects data about application programs, shared libraries, device drivers and operating system. Any component in the system is potentially a performance culprit. Finally, sampling and profiling provides performance information right down to the instruction level. This helps a programmer to identify code sequences or program behavior that are not in tune with the pipelined organization of Alpha processors or memory system. These insights can be used to change compiler options, optimization techniques, source code and algorithms to remove bottlenecks and achieve higher performance.

DCPI and ProfileMe use two different sampling techniques and the interpretation of some performance event data is different. DCPI was originally written to support in order processors such as the Alpha 21064 and the Alpha 21164. Program counter values and hardware performance registers are periodically sampled. Over time, event counts pile up, statistically, of course, on instructions. The "hottest" instructions have the highest event counts such as processor cycles, I- or D-cache misses, etc. The software tools are able to resolve certain timing problems to correctly associate event samples with instructions.

The following table shows the event types supported by both the Alpha 21064 and 21164. The 21064 and 21164 each support additional event types. The 21164, which is still in wide use, includes event types for ITB/DTB misses, L3 (B-) cache behavior, replay traps, and single-, triple- and quad-issue. Please see the dcpid(1) man page for further details.

Event type	Description
cycles	Processor cycles
issues	Instruction issues
nonissue	Non-issue cycles
imiss	Instruction cache misses
dmiss	Data cache misses
branchmp	Branch mispredicts
flow	Flow control changes
pipelinedry	Pipeline dry cycles
issue2	Cycles with 2 issues
intop	Integer operations (excluding loads/stores)
fpop	Floating point operations (excluding loads/stores)
load	Load instructions
store	Store instructions

Alpha 21064 and 21164 DCPI event types.

Things changed with the Alpha 21264. Out-of-order execution makes the exact order of instruction execution indeterminate. Measurements taken by sampling the program counter skew and smear making the assignment of event sample to instruction very difficult to impossible. ProfileMe was added, and instead, it uses instruction sampling to collect event data. At the fetch stage, an instruction is randomly selected for profiling. Information about execution events is collected as the instruction passes through the pipeline. This information is precise, it can be directly attributed to the program counter value and it provides a record of what happened to the instruction during its execution.

Two major types of 21264 processors have been manufactured to date: the 21264 and 21264A. The Alpha 21264 (EV6) incorporates program counter sampling. The Alpha 21264A (EV67) implements ProfileMe instruction sampling. The following table summarizes the events supported and sampled in the 21264. Due to smear and skew, samples cannot be ascribed to a particular instruction with certainty.

Event type	Description
cycles	Processor cycles
retires	Retired instructions
replaytrap	Memory system replay traps
itbmiss	Retired ITB misses
dtbmiss	Retired DTB misses
dtbdblmiss	Retired DTB double misses
retcondbr	Retired conditional branches
retunalign	Retired unaligned traps

Alpha 21264 (EV6) DCPI event types.

The table below summarizes the instruction events recorded by the Alpha 21264A. These events are recorded for each instruction that is selected for sampling. Since the program counter value for a sampled instruction is known, event attribution is precise. The taken and cbrmispredict event types are valid for conditional branch instructions only. The ldstorder event type is valid for only load and store instructions.

Event type	Description
retired	The instruction retired
valid	The instruction retired and did not cause a trap
taken	The conditional branch was taken
cbrmispredict	The conditional branch was mispredicted
nyp	Not yet prefetched
ldstorder	The instruction caused a load-store order replay trap
map_stall	The instruction stalled after fetch and before mapping
early_kill	The instruction was killed before it entered an issue queue
late_kill	The instruction was killed after it entered an issue queue

Alpha 21264A (EV67) ProfileMe event types.

ProfileMe also records and collects information about instruction traps. The data for each instruction includes a set of trap bits that indicate the cause of an instruction trap if one occurred during the attempted execution of the instruction. The table below lists the 21264A ProfileMe trap bits. Only one of these bits will be asserted (set) in a given ProfileMe sample.

Event type	Description
notrap	No trap occurred
mispredict	The instruction caused a mispredict trap
replays	The instruction caused a replay trap
unaligntrap	The instruction caused an unaligned load or store
dtbmiss	The instruction caused a DTB single miss
dtb2miss3	The instruction caused a DTB double miss (3 level page tables)
dtb2miss4	The instruction caused a DTB double miss (4 level page tables)
itbmiss	The instruction caused on instruction TLB miss
arithtrap	The instruction caused on arithmetic trap
fpdisabledtrap	The instruction caused a floating point disabled trap
dfaulttrap	The instruction caused a data stream page fault
iacvtrap	The instruction caused an instruction stream access violation
OPCDECtrap	The instruction attempted to execute a reserved instruction or privileged instruction
interrupt	The instruction was preempted by an interrupt

Alpha 21264A (EV67) ProfileMe trap bits.

ProfileMe support in the 21264A includes two programmable event counters. The table below is a list of ProfileMe events that can be measured using these counters. We will discuss all of these features in more detail later. Although the list of event types, trap bits and counter values is daunting, only a few data items are needed at any one time to answer a specific performance question.

Counter value name	Description
retdelay	Instruction retire delay
inflight	Number of cycles the instruction was inflight
bcmisses	Number of B-cache (L2) cache misses
replays	Number of memory system replay traps

Alpha 21264A (EV67) ProfileMe counter values.

4. Using DCPI and ProfileMe

This section discusses the "mechanics" of analyzing a program with DCPI/ProfileMe. We will introduce a few of the most important tools, concentrating on 21264A and ProfileMe, and show how to use them. We'll also take a quick tour of the DCPI database.

The table below lists the principal tools in the DCPI/ProfileMe suite. Be sure to consult the man pages as they contain many important details. The man page dcpid(1) contains a full list of event types supported by 21064, 21164 and 21264, and is a handy on-line resource. The man page dcpiprofileme(1) describes the use of DCPI to collect and analyze ProfileMe data on 21264A. It includes a description of ProfileMe events, counters and trap bits.

Program	Description
dcpid	DCPI daemon (starts measurement)
dcpiflush	Flushes samples to the DCPI database
dcpiquit	Stops the daemon (stops measurement)
dcpiepoch	Starts a new profiling epoch
dcpilabel	Assigns a label to the profiles for a command
dcpiprof	Summarize profile data by image or procedure
dcpilist	Display line-by-line profile data for a procedure
dcpiwhatcg	Where have all the cycles gone in an image?
dcpicalc	Calculate cycles-per-instruction for procedures
dcpitopstalls	List the instructions with the most stall cycles
dcpicat	Prints the contents of a profile file

DCPI/ProfileMe programs.

Performance analysis with DCPI/ProfileMe consists of five basic steps.

Collect data.
Break down the data by image to find the most active program components.
Break down the most active program components by procedure to find the most active procedures.
Display an instruction level analysis of the most active procedures.
Interpret the analysis, find potential issues, make a change and and repeat these steps.

This section will deal mainly with the first four steps. The next section (Section 5) will treat interpretation.

To illustrate these steps, we will use a short program taken from morphological image processing. The program applies a structuring element to each of the pixels in an image to dilate its features. The heart of the program consists of the two functions DilateImage and DilatePixel (shown below.) DilateImage steps across the image and calls DilatePixel to apply the structuring element to the pixels.

DilatePixel(r, c) int r, c ;

  {
  register int i, j ;
  int sum, the_max ;

  the_max = MINUS_INFINITY ;

  for (i = 0 ; i < 31 ; i++)
    {
    for (j = 0 ; j < 31 ; j++)
      {
      sum = A[r+i][c+j] + B[i][j] ;
      if (sum > the_max) the_max = sum ;
      }
    }

  Z[r][c] = the_max ;
  }

DilateImage()

  {
  register int r, c ;

  for (r = 0 ; r < IMAGE_ROWS ; r++)
    {
    for (c = 0 ; c < IMAGE_COLS ; c++)
      {
      DilatePixel(r, c) ;
      }
    }
  }

We will analyze the dilation program using DCPI/ProfileMe.

4.1 Collecting data

Five simple steps are needed to collect data:

Compile the application program and prepare the input data set.
Create a directory to hold the DCPI database if one does not exist already.
Start the DCPI daemon specifying the event data to be collected.
Run the application program.
Stop collection, flush sample data to the database and terminate the DCPI daemon.

Tru64 users should create a DCPI database directory during installation and create an initial scan map of program images. (Please see Appendix B for details.) DCPI/ProfileMe must be run by the superuser. The example session below illustrates the five simple steps for data collection.

1 > cc -O2 -arch ev6 -o dilate dilate.c
2 > setenv DCPIDB /dsk0h/dcpidb
3 > dcpid -event pm $DCPIDB
dcpid: load   : loaded pcount driver
dcpid: monitoring pm
dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log
4 > dilate heart.ima ball out.ima
Dilation (dilate) untimed `spec' version
Image file (heart.ima) read.
Structuring element file (ball) read.
Output image file (out.ima) written.
Leaving program (dilate)
5 > dcpiquit
dcpictl: quit successful

The cc command at line 1 compiles the source code for the dilation program at optimization level 2. The program is compiled for the EV6 (Alpha 21264) target architecture. The command at line 2 creates an environment variable, DCPIDB, which allows us to more conveniently enter the path to the DCPI database directory. The database directory is located at /dsk0h/dcpidb. The dcpid command on line 3 starts the DCPI data collection daemon. The dcpid command offers the most options. The slot (event) specifier, -event pm, tells DCPI to collect ProfileMe data and to write profile data into the database directory given by the environment variable DCPIDB. The command on line 4 runs the application program. The command on line 5, dcpiquit, flushes any remaining samples to the database and stops the DCPI daemon.

In addition to the ProfileMe event types mentioned earlier, DCPI measures the instruction's retire delay and the approximate number of cycles that the instruction was in flight. ProfileMe calls these additional measurements "counters." ProfileMe supports four counter configurations as summarized in the table below.

Configuration	Counter 0	Counter 1
pm0	Retires	Inflight cycles
pm	Inflight cycles	Retire delay
pm2	Retires	B-cache (level 2 cache) misses
pm3	Inflight cycles	Memory system replay traps

ProfileMe counter configurations.

If you are interested in a particular measurement, you need to specify the appropriate counter configuration and collect the data. Otherwise, the analysis tools will not be able to summarize and display the desired quantity. The command:

dcpid -event pm2 $DCPIDB

for example, measures the number of retires and B-cache misses.

4.2 A quick tour of the database

Let's cd into the database directory and list the directory contents. The DCPI daemon produces several directories and files.

10 > cd $DCPIDB
11 > pwd
/dsk0h/dcpidb
12 > ls
200104041750/                       default_scanmap
PALcode@                            hosts
PALcode.positron.65558.131411*      index/
PALcode.positron.65558.131411.log   version
dcpid-positron.log

The log file, dcpid-positron.log, shows details about the measurement run. It includes CPU type and speed, process creation history, etc. dcpid provides a useful option, -logmaps, to write image information to the log file when an image is loaded. Here is an example line showing where the "dilate" program image was loaded:

dcpid: loadmap: loaded  : pid  3443 id 3acb5b2e00206526
  [0000000120000000..0000000120800000] [scan  ] (dilate)

The subdirectory with the long numeric name, 200104041750, represents a single measurement epoch (the period of time during which measurements were taken.) An epoch is created for each successful measurement run. A new epoch can also be started without terminating and restarting dcpid by executing the dcpiepoch command. The subdirectory name is the starting time (year, month, day, hour, minute) of the epoch.

An epoch directory contains data for each of the host CPUs (e.g., positron), one subdirectory per host. Within a host subdirectory are profile data files with long cryptic names such as "dilate_20010404173438206526." The contents of a profile data file can be listed using dcpicat:

dcpicat dilate_20010404173438206526

dcpicat displays the profile header information (shown below) followed by a list of address/event counts. Profile data is kept in a compact form. Address/count pairs are stored only for those pairs with a non-zero count, so addresses will probably not be contiguous.

name                 dilate
image                20010404173438206526
path                 dilate
epoch                200104041750
platform             positron
text_start           0x000120000000
text_size            0x000000004000

Examining profile data is not for the faint of heart and most users will not really care about the contents of profile files. dcpicat output, however, can be postprocessed and some users may want to write their own tools to do so.

4.3 Break down by image

The dcpiprof program breaks out event profile data by binary program image and by procedure within an image. Include the -i option on the command line to break out results by image. The command:

dcpiprof -pm retired -i

produces an image by image summary of retired instructions as shown below.

Column                Total  Period (for events)
------                -----  ------
retired:count        353866   63488

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
retired                
 :count       %    cum% image
 313353  88.55%  88.55% /vmunix
  40119  11.34%  99.89% dilate
    106   0.03%  99.92% /usr/shlib/libXt.so
    105   0.03%  99.95% /dsk0h/dcpidb/PALcode
     75   0.02%  99.97% /usr/bin/dcpid
     28   0.01%  99.98% /usr/shlib/libc.so
     24   0.01%  99.98% /usr/shlib/X11/libdix.so
     14   0.00%  99.99% /usr/shlib/libX11.so
     11   0.00%  99.99% /usr/shlib/libXm.so
      8   0.00%  99.99% /usr/shlib/libmach.so
      6   0.00% 100.00% /usr/dt/bin/dtterm
      5   0.00% 100.00% /usr/shlib/X11/libxkb.so
      4   0.00% 100.00% unknown@positron
      2   0.00% 100.00% /usr/shlib/X11/libos.so
      2   0.00% 100.00% /usr/shlib/libpthread.so
      1   0.00% 100.00% /usr/shlib/X11/libmi.so
      1   0.00% 100.00% /bin/sh
      1   0.00% 100.00% /usr/shlib/X11/libcfb.so
      1   0.00% 100.00% /usr/bin/csh

This summary shows the number instruction retire events for each active program or library image during the measurement epoch. The dilation program, dilate, had 40119 instruction retire events. Note that the summary is system-wide including event counts for the Tru64 kernel (/vmunix), the operating system PALcode (/dsk0h/dcpidb/PALcode), the DCPI daemon (/usr/bin/dcpid), various runtime libraries (*.so), and shell programs (sh and csh.)

Hint: Although the kernel image, /vmunix, has many retire events, these are due to "idle time" processing. To reduce idle time, put the commands to start DCPI, run the application and stop DCPI in a single shell command line. We'll use this technique in later examples.

dcpiprof can produce a summary for any of the ProfileMe event types as specified by the argument to the -pm option in the dcpiprof command line. The commands:

dcpiprof -pm replay -i
dcpiprof -pm mispredict -i
dcpiprof -pm unaligntrap -i

will produce image by image summaries of replay traps, branch mispredicts and unaligned memory access traps, respectively. This is a quick way to find images with performance problems like alignment fix-ups.

dcpiprof will also produce image by image summaries for ProfileMe counter values. When we collected data for this example, we used the default counter configuration (-pm), which measures the number of cycles an instruction is inflight and instruction retire delay. The commands:

dcpiprof -pm retired:inflight -i
dcpiprof -pm retired:retdelay -i

produce summaries for the number of inflight cycles for retired instructions and the retire delay for retired instructions, respectively. Note that the counter name must be combined with a ProfileMe event to select instructions that satisfy some event criteria (e.g., retired, suffered a replay trap, etc.)

This syntax is sufficient to display results for individual event types and trap bits. ProfileMe supports a more flexible and powerful way to select event data of interest. Recall that each instruction which is sampled using ProfileMe has a set of event bits, trap bits, counter values and PC address associated with it. The ProfileMe software aggregates and organizes sample data using event bits, trap bits and program counter values.

The argument to -pm in the dcpiprof command is really an expression that selects a set of instructions/samples which satisfy a particular combination of event flags and trap bits. Two selection operators are supported:

"^" Logical AND of event/trap bits.
"!" Logical NOT of event/trap bit.

This, in effect, forms a query expression to be satisfied when searching the DCPI/ProfileMe database. dcpiprof will find, summarize and display data for all instructions that satisfy the selection expression. For example,

dcpiprof -pm retired^notrap -i

selects sample data for instructions that retired without trapping. The exclamation mark "!" takes the logical NOT of an event flag or trap bit as shown in this example:

dcpiprof -pm taken^\!mispredict -i

This command selects instructions where a branch was taken without a mispredict. The exclamation mark must be quoted on the command line to avoid misinterpretation by the shell. Backslash, "/", may be used in place of "!".

The selection expression may also include the "+" operator. The plus operation separates event/trap events and expressions. dcpiprof will select samples for each AND/NOT expression (or event type) and display it in a separate column. For example, the command:

dcpiprof -pm retired+replay -i

displays a table with two columns:

retired                 replays        
 :count       %    cum%  :count       % image
 313353  88.55%  88.55%      12  70.59% /vmunix
  40119  11.34%  99.89%       0   0.00% dilate
    106   0.03%  99.92%       0   0.00% /usr/shlib/libXt.so
    105   0.03%  99.95%       1   5.88% /dsk0h/dcpidb/PALcode
     75   0.02%  99.97%       1   5.88% /usr/bin/dcpid
     28   0.01%  99.98%       0   0.00% /usr/shlib/libc.so
     24   0.01%  99.98%       1   5.88% /usr/shlib/X11/libdix.so
     14   0.00%  99.99%       0   0.00% /usr/shlib/libX11.so
     11   0.00%  99.99%       1   5.88% /usr/shlib/libXm.so
      8   0.00%  99.99%       0   0.00% /usr/shlib/libmach.so
      6   0.00% 100.00%       1   5.88% /usr/dt/bin/dtterm
      5   0.00% 100.00%       0   0.00% /usr/shlib/X11/libxkb.so
      4   0.00% 100.00%       0   0.00% unknown@positron
      2   0.00% 100.00%       0   0.00% /usr/shlib/X11/libos.so
      2   0.00% 100.00%       0   0.00% /usr/shlib/libpthread.so
      1   0.00% 100.00%       0   0.00% /usr/shlib/X11/libmi.so
      1   0.00% 100.00%       0   0.00% /bin/sh
      1   0.00% 100.00%       0   0.00% /usr/shlib/X11/libcfb.so
      1   0.00% 100.00%       0   0.00% /usr/bin/csh

The next example command:

dcpiprof -pm retired+taken^mispredict -i

displays the following two column table:

                            taken^        
retired                 mispredict        
 :count       %    cum%     :count       % image
 313353  88.55%  88.55%          6  54.55% /vmunix
  40119  11.34%  99.89%          0   0.00% dilate
    106   0.03%  99.92%          0   0.00% /usr/shlib/libXt.so
    105   0.03%  99.95%          1   9.09% /dsk0h/dcpidb/PALcode
     75   0.02%  99.97%          1   9.09% /usr/bin/dcpid
     28   0.01%  99.98%          0   0.00% /usr/shlib/libc.so
     24   0.01%  99.98%          1   9.09% /usr/shlib/X11/libdix.so
     14   0.00%  99.99%          0   0.00% /usr/shlib/libX11.so
     11   0.00%  99.99%          1   9.09% /usr/shlib/libXm.so
      8   0.00%  99.99%          0   0.00% /usr/shlib/libmach.so
      6   0.00% 100.00%          0   0.00% /usr/dt/bin/dtterm
      5   0.00% 100.00%          0   0.00% /usr/shlib/X11/libxkb.so
      4   0.00% 100.00%          0   0.00% unknown@positron
      2   0.00% 100.00%          0   0.00% /usr/shlib/X11/libos.so
      2   0.00% 100.00%          0   0.00% /usr/shlib/libpthread.so
      1   0.00% 100.00%          1   9.09% /usr/shlib/X11/libmi.so
      1   0.00% 100.00%          0   0.00% /bin/sh
      1   0.00% 100.00%          0   0.00% /usr/shlib/X11/libcfb.so
      1   0.00% 100.00%          0   0.00% /usr/bin/csh

Use of the "+" sign is a bit unfortunate since it does not perform a logical OR between event/trap bits. It does let us create multi-column tables, however.

The ability to select arbitrary sets of instruction samples is quite general and powerful. Not all combinations of instruction event flags, trap bits and counter values are even meaningful or useful, however. Just because you can ask a question with dcpiprof doesn't mean that you should!

4.4 Break down by procedure

Based on the number of retired instructions, the program image dilate has the most relevant activity. Let's break down this image by procedure using dcpiprof:

dcpiprof -pm retire dilate

The only difference in the command line is the addition of the image name in place of the -i option. This command generates the table:

Column                Total  Period (for events)
------                -----  ------
retired:count         40119   63488

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
retired                
 :count       %    cum% procedure                 image
  39848  99.32%  99.32% DilatePixel               dilate
    116   0.29%  99.61% WriteOutputImage          dilate
    111   0.28%  99.89% ReadInputImage            dilate
     35   0.09%  99.98% DilateImage               dilate
      9   0.02% 100.00% Initialize                dilate

We discussed event selection in the previous section (4.3.) All of those features can be used to produce procedure by procedure summaries of various event types, trap bits, etc.

4.5 Instruction level analysis

The procedure DilatePixel has the most retired instructions as expected since this function contains the inner most loops of the dilation program.

dcpilist is used to produce an instruction level analysis of a procedure. The command:

dcpilist -pm retire DilatePixel dilate

produces an annotated disassembly of DilatePixel. The -pm option works in the same way as dcpiprof. Each line of the disassembly is annotated with the selected event information.

As you can see below, the output from dcpilist can be quite lengthy. It's best to redirect the output from dcpilist to a file and then examine the file with a text editor. The part of the function with the most traffic (retired instructions) starts at address 0x1200022f0 and ends at 0x120002370. This code corresponds to the innermost loop of DilatePixel.

`DilatePixel` - dcpilist analysis

DilatePixel:
retired 
 :count freq   
      0    3        0x120002290   ldah    gp, 8192(t12)
      0    3        0x120002294   ldq_u   zero, 0(sp)
      0    3        0x120002298   lda     gp, 24112(gp)
      0    3        0x12000229c   ldq_u   zero, 0(sp)
      3    3        0x1200022a0   addl    zero, a0, a0
      3    3        0x1200022a4   addl    zero, a1, a1
      2    3        0x1200022a8   addq    a0, a0, v0
      3    3        0x1200022ac   ldq     t1, -32592(gp)
      3    3        0x1200022b0   ldq     t2, -32568(gp)
      1    3        0x1200022b4   lda     t0, -32767(zero)
      8    3        0x1200022b8   s8addq  v0, a0, v0
      3    3        0x1200022bc   addq    a1, a1, a1
      0    3        0x1200022c0   addq    v0, v0, v0
      0    3        0x1200022c4   s8subq  v0, a0, v0
      2    3        0x1200022c8   lda     t1, 0(t1)
      7    3        0x1200022cc   bis     zero, zero, a0
      6    3        0x1200022d0   lda     t2, 0(t2)
      7    3        0x1200022d4   s4addq  v0, t1, t1
      5    3        0x1200022d8   ldq_u   zero, 0(sp)
      2    3        0x1200022dc   ldq_u   zero, 0(sp)
    124  124        0x1200022e0 : addq    t1, a1, t3
    124  124        0x1200022e4   bis     zero, zero, t4
    107  124        0x1200022e8   bis     zero, t2, t5
    136  124        0x1200022ec   ldq_u   zero, 0(sp)
    888  897        0x1200022f0 : ldwu    t7, 0(t3)
    911  897        0x1200022f4   ldwu    a3, 0(t5)
    932  897        0x1200022f8   addl    t4, 0x4, t4
    821  897        0x1200022fc   lda     t3, 8(t3)
    909  897        0x120002300   ldwu    a4, -6(t3)
    936  897        0x120002304   ldwu    a5, 2(t5)
    891  897        0x120002308   lda     t5, 8(t5)
    922  897        0x12000230c   ldwu    t8, -4(t3)
    905  897        0x120002310   ldwu    t9, -4(t5)
    953  897        0x120002314   sextw   t7, t7
    904  897        0x120002318   sextw   a3, a3
    863  897        0x12000231c   sextw   a4, a4
    916  897        0x120002320   sextw   a5, a5
    857  897        0x120002324   addq    t7, a3, t7
    912  897        0x120002328   ldwu    a3, -2(t3)
    903  897        0x12000232c   addq    a4, a5, a4
    897  897        0x120002330   cmplt   t0, t7, a5
    882  897        0x120002334   sextw   t8, t8
    876  897        0x120002338   sextw   t9, t9
   1883  897        0x12000233c   cmovne  a5, t7, t0
    861  897        0x120002340   addq    t8, t9, t8
    887  897        0x120002344   ldwu    t9, -2(t5)
    890  897        0x120002348   sextw   a3, a3
    861  897        0x12000234c   cmplt   t0, a4, t10
   1786  897        0x120002350   cmovne  t10, a4, t0
    894  897        0x120002354   sextw   t9, t9
    900  897        0x120002358   cmplt   t4, 0x1c, a4
    880  897        0x12000235c   addq    a3, t9, a3
    896  897        0x120002360   cmplt   t0, t8, t7
   1791  897        0x120002364   cmovne  t7, t8, t0
    922  897        0x120002368   cmplt   t0, a3, a5
   1803  897        0x12000236c   cmovne  a5, a3, t0
    883  897        0x120002370   bne     a4, 0x1200022f0
    150  124        0x120002374   ldq_u   zero, 0(sp)
    114  124        0x120002378   ldq_u   zero, 0(sp)
    109  124        0x12000237c   ldq_u   zero, 0(sp)
    413  388        0x120002380 : ldwu    t7, 0(t3)
    395  388        0x120002384   ldwu    t9, 0(t5)
    371  388        0x120002388   addl    t4, 0x1, t4
    377  388        0x12000238c   lda     t3, 2(t3)
    398  388        0x120002390   cmplt   t4, 0x1f, a2
    384  388        0x120002394   lda     t5, 2(t5)
    436  388        0x120002398   sextw   t7, t7
    389  388        0x12000239c   sextw   t9, t9
    402  388        0x1200023a0   addq    t7, t9, t7
    377  388        0x1200023a4   cmplt   t0, t7, t6
    715  388        0x1200023a8   cmovne  t6, t7, t0
    395  388        0x1200023ac   bne     a2, 0x120002380
    117  124        0x1200023b0   addl    a0, 0x1, a0
    126  124        0x1200023b4   lda     t1, 1084(t1)
    138  124        0x1200023b8   cmplt   a0, 0x1f, a5
    120  124        0x1200023bc   lda     t2, 62(t2)
    134  124        0x1200023c0   bne     a5, 0x1200022e0
      2    3        0x1200023c4   ldq     a4, -32576(gp)
      8    3        0x1200023c8   lda     a4, 0(a4)
      4    3        0x1200023cc   s4addq  v0, a4, v0
      3    3        0x1200023d0   addq    v0, a1, v0
      6    3        0x1200023d4   stw     t0, 0(v0)
      4    3        0x1200023d8   ret     zero, (ra), 1
      0    0        0x1200023dc   ldq_u   zero, 0(sp)

5. Investigating program performance

Interpretation of DCPI/ProfileMe results is the most difficult part of the performance analysis. Superscalar processors in general rely upon effective use of the processing pipeline and the memory hierarchy. Analysis relies upon an understanding of the memory system and pipeline behavior. Behavior of a superscalar pipeline is complex, especially in an out-of-order execution machine like the Alpha 21264, and one of many possible factors may cause a bottleneck. Section 2 of this paper is a brief introduction to Alpha 21264 pipeline behavior. Another resource for information on superscalar computing and Alpha programming is the paper Performance tips for Alpha Linux C programmers. Although a few of the tips in that paper are Linux-specific, many of the techniques apply to Alpha programs generically. The bibliography contains a number of references on the Alpha 21264 design which may also be helpful.

This section is a collection of possible performance problems. We'll describe DCPI/ProfileMe measurements that may indicate the presence of a particular problem. Determination of the exact cause generally involves a close examination of the algorithm, source code and annotated disassembly. It's often important to understand the dynamic behavior of the program especially the way in which the program accesses data in memory. On the Alpha 21264, the immediate instruction execution history (up to 80 instructions) is sometimes relevant since later instructions can be affected by the scheduling/execution of earlier instructions in the pipeline.

5.1 A broad indicator of performance.

A broad indicator of performance is the ratio of retired instructions to raw cycles. This ratio is a rough measure of the degree of instruction level parallelism achieved during program execution. Higher values indicate that more work is being carried out in parallel.

The ratio is easily computed by measuring both retires and cycles, then dividing retires by cycles:

61 > dcpid -slot cycles -slot pm $DCPIDB ; dilate heart.ima ball out.ima ; dcpiquit
dcpid: load   : loaded pcount driver
dcpid: monitoring cycles
dcpid: monitoring pm
dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log
Dilation (dilate) untimed `spec' version
Image file (heart.ima) read.
Structuring element file (ball) read.
Output image file (out.ima) written.
Leaving program (dilate)
dcpictl: quit successful
62 > dcpiprof -i -pm retire -event cycles
Column                Total  Period (for events)
------                -----  ------
cycles                 9212  126976
retired:count         20439  126976

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
                       retired        
cycles       %    cum%  :count       % image
  8361  90.76%  90.76%   19536  95.58% dilate
   724   7.86%  98.62%     816   3.99% /vmunix
    50   0.54%  99.16%      39   0.19% /usr/bin/dcpid

The ratio of retired instructions to cycles in this example is (19536/8361) = 2.336. The dcpid command above "multiplexes" or shares the DCPI/ProfileMe hardware counters between the measurement of CPU cycles and ProfileMe events. (Yep, you can do that.) The example also demonstrates how measured "idle" time in the kernel can be minimized by combining commands into a single line, reducing the amount of "dead" time due to typing.

The Alpha 21264 is able to retire up to eleven instructions in a single cycle. It can sustain a rate of eight retires per cycle for short periods of time. This peak figure is further limited in real code by the availability (or shortage) of physical integer and floating point registers. In integer code with D-cache references, a maximum of 3.3 retires per cycle could be achieved, but is quite unlikely in practice. In floating point code with D-cache references, the maximum is 2.3 floating point instructions per cycle, again, a maximum that is unlikely to be achieved in real code. Thus, a retire/cycle ratio of 3 in integer code is exceptionally good. If the ratio is in the neighborhood of one retire per cycle, then further improvement is possible and investigation is warranted.

5.2 Branch mispredicts.

As described in Section 2, the Alpha 21264 fetches instructions aggressively. It uses branch prediction to predict the direction of conditional branches and then fetches instructions based upon the predicted direction. The effectiveness of speculative instruction fetch and execution depends upon the ability of the branch predictor to make predictions correctly.

The ProfileMe cbrmispredict event bit is set when a (sampled) conditional branch instruction is mispredicted. For a given conditional branch instruction, the ratio of its branch mispredictions to its retires is its misprediction rate. The misprediction rate is more helpful than the raw number of mispredictions since it takes into account the number of times that the instruction was fully executed to retirement.

The following example is taken from measurements made on the dilation program. The dcpilist command illustrates another feature supported in the -pm, the ability to compute a ratio. The "::" operator tells dcpilist to compute the ratio of cbrmispredict and the number of retires.

37 > dcpilist -pm cbrmispredict::retired DilatePixel dilate
DilatePixel:
cbrmispredict 
       :count 
           :: 
      retired 
       :count freq   
           NA    3        0x120002290   ldah    gp, 8192(t12)
           NA    3        0x120002294   ldq_u   zero, 0(sp)
...
       0.0000  897        0x120002370   bne     a4, 0x1200022f0
...
       0.0000  388        0x1200023ac   bne     a2, 0x120002380
...
       0.0224  124        0x1200023c0   bne     a5, 0x1200022e0
...

There are three conditional branch instructions in DilatePixel. The number of sampled mispredicts for these instructions are 0, 0 and 3, respectively. The branch at address 0x120002370 is at the bottom of the innermost loop. No mispredicts were measured indicating that the misprediction rate is quite low, which is good, of course.

No branch mispredictions were detected because none of the sampled executions of the conditional branch at 0x120002370 mispredicted. This is not the same as saying "none of the executed branches was mispredicted." Since ProfileMe is based on a sampling technique, it's possible that the execution time of the program was too short to detect even a single misprediction. In general, program runtime should be long enough to collect a statistically significant set of samples (one minute or longer.) Alternatively, a short program could be run multiple times within the same DCPI epoch to increase the amount of data collected.

The taken ProfileMe event indicates that the sampled branch was taken. This event can be used to determine the most frequent direction (taken or not taken) for a branch. If the ratio of taken events to retires for a particular branch instruction is greater than 50%, then the taken direction is the most frequent.

The mispredict trap bit is set for any conditional branch, subroutine jump (JSR), subroutine return (RET), unconditional jump (JMP) and jump to coroutine (JMP_COROUTINE) that suffered a misprediction trap. This bit indicates the failure of both I-cache line and way (direction) prediction. (The subject of line prediction is beyond the scope of this paper. See the IEEE Micro paper by Kessler.) Line and way predictors are correct 85% to 100% of the time for real world applications. Misprediction rates higher than 15% may indicate a problem.

If branch mispredictions are a problem, how can they be eliminated or reduced in number? Generally, the 21264 favors straight-line code -- long sequences of instructions that are uninterrupted by branches. Briefly, some suggestions are:

Ask the compiler to unroll loops making loop bodies longer and increasing the number of instructions between branches.
For similar reasons, ask the compiler to expand functions in-line. Be aware, though, that in-lining can sometimes have a negative effect on instruction cache (I-cache) behavior making performance worse.
Reduce the number of branches by forcing the compiler to use conditional move instructions instead of branches.
If possible, choose an algorithm or write code such that decisions or branches will very likely go in the same direction. This will help the branch predictor learn from the branch behavior.

These suggestions are sometimes difficult to implement in practice because higher level language programmers have only indirect control over the code produced by the compiler. Improvement is an iterative process of making a change, compiling, observing the code produced by the compiler, taking measurements, and then trying another change.

Here are a few fine points about branch prediction. The 21264 uses a tournament branch prediction scheme. It adaptively chooses between a global predictor and a local predictor to make the call and predict branch direction. The choice, global and local predictors are table driven with 4,096, 4,096 and 1,024 entries each, respectively. The choice and global prediction tables are indexed by the most recent global path history (the taken/not taken state of the twelve most recent branches) and the local prediction table is indexed by bits 11:2 of the virtual program counter value of the current branch.

The 21264 fetches four instructions at a time. These four instructions are aligned on a 16-byte boundary. This aligned group of four instructions is called a "quadpack." The branch and I-cache line predictors perform best when there is only one conditional branch per quadpack. If branch density cannot be decreased through the techniques mentioned above, critical branches can be spaced out among quadpacks by inserting "no-op" instructions. There are three no-op instructions:

Mnemonic	Disassembly	Purpose
`UNOP`	`LDQ_U R31,0(Rx)`	Universal noop
`NOP`	`BIS R31,R31,R31`	Integer noop
`FNOP`	`CPYS F31,F31,F31`	Floating noop

Alpha no-op instructions.

These instructions should not encounter operand issue delay, destination issue delay and functional unit delay.

Since the 21264 fetches an entire quadpack at a time, fetch efficiency is maximized when all four instructions execute. If, for example, control enters a quadpack in its middle or at the bottom, then not all instructions in the quadpack will execute and the processor is effectively fetching fewer than four instructions per clock cycle. Fetch efficiency is improved by:

Aligning branch and jump targets to a quadpack boundary such that control enters the top of the quadpack, and
Putting branches out of the quadpack at its end such that control leaves the quadpack at or near the bottom.

Generally, higher level language programmers are at the mercy of the compiler with respect to this aspect of quadpacking. Compilers implement quadpacking, but the programmer may need to specify the target architecture at compile time (e.g., -arch ev6) to enable this optimization.

5.3 Replay traps.

The 21264 detects and handles hazards that may produce incorrect results when instructions are issued and executed out of program order. Some of these hazards are removed before instruction issue as in the case of certain register dependencies. Some hazards, however, cannot be detected until an instruction has issued and has begun execution. Memory system hazards, for example, cannot be detected until load and store instructions have issued and their effective addresses have been evaluated. Such dynamic hazards are eliminated by replaying the instruction stream from an architecturally correct and consistent state. The mechanism for replaying the instruction stream is called the "replay trap."

Section 2.6 of Performance Tips for Alpha Linux C programmers discusses memory system replay traps in some detail. There are five main kinds of 21264 memory system replay traps:

A load-store replay trap occurs when a load, which consumes the value from a store to the same address, has issued before the store. The store is allowed to complete before restarting execution from the load.
A load-load replay trap occurs when two loads from the same address issue out of program order. The loads are forced to complete in program order.
A load-miss load replay occurs when a load misses and it is followed by another load from the same address. The first load is allowed to complete before executing the second load.
A wrong size replay trap is performed when a store is followed by a load that reads the same data and the load data type is larger than the store. The store is allowed to complete before the load is issued.
A cache line contention replay trap occurs when loads and stores that are in progress at the same time and that map to the same cache line (32Kb spacing/stride.) The processor replays the instruction that triggered the replay trap until the earlier memory operations have completed.

The cost of a replay trap depends upon how often the replay trap occurs, the nature of the replay trap itself, and the number of instructions that must be aborted (discarded.) The aborted instructions collectively are called the replay trap "shadow" and the length of the shadow is the number of aborted instructions.

Two broad ProfileMe metrics can be used to identify program components that are taking an unacceptable number of replay traps. The ratio of trapped instructions to retired instructions:

!notrap:count::retired:count

is an indicator of trap frequency (X traps per Y retires.) The ratio of non-retired (aborted) instructions to retired instructions:

!retired:count::retired:count

indicates the ratio of the amount of work discarded to the amount of work completed (X aborts per Y retires.) These ratios can be computed and displayed by dcpiprof to produce an image-by-image summary. Ideally, both ratios will be small for each image. A trap to retire ratio of 0.02 or greater and/or an abort to retire ratio of 0.25 or greater may indicate the presence of a replay-related performance problem. Once the problem has been narrowed to a particular image, dcpiprof can display the number of replay events per procedure to identify particular procedures for detailed, instruction-level analysis.

Here is a simple program, siz_replays.c, that is designed to cause a large number of wrong size replay traps.

#define LOOP_COUNT (16)
#define ARRAY_SIZE (16*1024*1024)

long int mem_loc[ARRAY_SIZE] ;

long int write_part_read_all(int *pA, long int *pB)
{
  *pA = 35 ;
  return( *pB ) ;
}

main()
{
  long int i, j, z ;
  for (i = 0 ; i < LOOP_COUNT ; i++) {
    for (j = 0 ; j < ARRAY_SIZE ; j++) {
      // Write the first four bytes at mem_loc[j]
      // then read all eight bytes causing a wrong
      // size replay trap
      z = write_part_read_all((int *)&mem_loc[j], &mem_loc[j]) ;
    }
  }
}

The function write_part_read_all takes two arguments: a pointer to a 32-bit integer and a pointer to a 64-bit long integer. The assignment statement writes 35 to the 4 bytes in memory as referenced by the first pointer. The function then reads and returns the 8 byte value at the location referenced by the second pointer. The function main repeatedly calls write_part_read_all and passes in the same actual pointer value for both the first and second pointer parameter. The pointer arguments are aliases for the same base memory address although the pointers refer to data items of different size.

Let's examine the assembler language code for write_part_read_all. We obtained the assembler language code by compiling the program with the Compaq C compiler using the -Wf,-S option:

ccc -Wf,-S -o siz_replays siz_replays.c

This produces the file siz_replays.s. The assembler code for the body of write_part_read_all is:

 #      7 {
 #      8   *pA = 35 ;

	mov	35, $1         # 000008
	stl	$1, ($16)

 #      9   return( *pB ) ;

	ldq	$0, ($17)      # 000009

 #     10 }

	ret	($26)          # 000010

When write_part_read_all is called, R16 and R17 will contain the first and second pointer arguments to the function. The code moves 35 into register R1 ($1) and stores the value of R1 into the 32-bit longword at the address specified by R16 ($16.) The store is immediately followed by a load from the 64-bit quadword at the address specified by R17 ($17.) Because main passes in a pointer to the same memory location for the first and second arguments, the store (stl) and load (ldq) will refer to the same memory address. However, the store instruction will write to the first four bytes ABCD as shown in the diagram below,

while the load instruction will read all eight bytes ABCDEFGH. Attempting this sequence of operations should trigger a wrong size replay trap.

The next step is to run the program and collect DCPI/ProfileMe data. Here are the shell commands that we used to collect DCPI/ProfileMe data:

setenv DCPIDB /dsk0h/dcpidb
dcpid -event pm $DCPIDB ; siz_replays ; dcpiquit

The execution time for siz_replays is a bit short (about 23.6 seconds), but that should be long enough to capture some replays.

Hint: Use the Tru64 UNIX time command to measure program execution time. If a program has a short execution time, run the program several times during a DCPI measurement epoch (period.) See the man page for dcpiepoch to get even more ideas about using measurement epoches.

The following example demonstrates how to produce an image-by-image summary of the abort to retire ratio and the ratio of discarded (non-retired) instructions to retired instructions. The command line illustrates a few short-cuts. Notice that it was not necessary to specify ::count following the ProfileMe event names as ::count is assumed. Backslash was used to denote logical NOT instead of the exclamation mark, which must be quoted to avoid misinterpretation by the shell. The event name retired could have been abbreviated as ret, further shortening the command line.

dcpiprof -i -pm trap::retired+/retired::retired
...
!notrap !retired
 :count   :count
     ::       ::
retired  retired
 :count   :count image
 0.2503   5.1905 siz_replays
 0.1250   1.5000 /usr/shlib/libpthread.so
 0.1000   0.0000 /usr/shlib/libmach.so
 0.0714   0.0714 /usr/shlib/libXt.so
 0.0484   0.3385 /dsk0h/dcpidb/PALcode
 0.0296   0.7757 /vmunix
 0.0132   0.5658 /usr/bin/dcpid

The ratio of trapped to retired instructions for siz_replays is quite large; roughly one instruction suffered a replay trap for every four instructions that successfully retired. The ratio of discarded to productive instructions (5.1905) is also very high. Five instructions were discarded for every retired instruction -- highly inefficient. Ratios are shown for other components (runtime libraries, the DCPI daemon, and the kernel), but realistically as an ordinary programmer, we can only tune the application siz_replays.

The next step is to identify the procedures with the most replay traps within a particular program image. Here is the command that we used to identify write_part_read_all as the culprit.

dcpiprof -pm replays siz_replays
...
replays                
 :count       %    cum% procedure                 image
  11669 100.00% 100.00% write_part_read_all       siz_replays

We knew this to be so by design, but isolation to a particular procedure is an essential step in replay trap analysis.

The final step is to use dcpilist to disassemble and display the culprit procedure.

dcpilist -pm retired+replays write_part_read_all siz_replays
write_part_read_all:
retired replays 
 :count  :count freq   
   4257       0 4208        0x120001190   bis     zero, 0x23, t0
   4135       0 4208        0x120001194   stl     t0, 0(a0)
   4223   11669 4208        0x120001198   ldq     v0, 0(a1)
   4220       0 4208        0x12000119c   ret     zero, (ra), 1

Here, the wrong size replay traps have accumulated on the load instruction. The register names a0 and a1 are synonyms for R16 ($16) and R17 ($17.)

Before leaving this example, let's rewrite the program to read a full 64-bit quadword instead of a 32-bit longword. This should eliminate the wrong size replay traps. The changes produce the following C code (occ_replays.c.)

#define LOOP_COUNT (16)
#define ARRAY_SIZE (16*1024*1024)

long int mem_loc[ARRAY_SIZE] ;

long int write_read(long int *pA, long int *pB)
{
  *pA = 35 ;
  return( *pB ) ;
}

main()
{
  long int i, j, z ;
  for (i = 0 ; i < LOOP_COUNT ; i++) {
    for (j = 0 ; j < ARRAY_SIZE ; j++) {
      // Read from a memory location (mem_loc[j]) after
      // writing to it causing occasional load-store replay
      // traps. Alpha 21264 "learns" not to execute the
      // load ahead of the store, so a replay will occur every
      // 64k cycles when the "learn" bit is reset.
      z = write_read(&mem_loc[j], &mem_loc[j]) ;
    }
  }
}

We renamed the first function to write_read and changed the type of the first parameter such that both formal parameters are now pointers to long integer. When main calls write_read, it passes in the same address as the value of the first and second actual parameter. The store followed by a load from the same address should trigger a load-store replay trap.

The modified program, occ_replays, executes in about 4.5 seconds, considerably faster than siz_replays. ProfileMe records load-store order traps, specifically, via ldstorder events. Running dcpilist produces the following transcript.

dcpilist -pm ldstorder write_read occ_replays
write_read:
ldstorder 
   :count 
        0      0x120001180   bis     zero, 0x23, t0
        1      0x120001184   stq     t0, 0(a0)
        0      0x120001188   ldq     v0, 0(a1)
        0      0x12000118c   ret     zero, (ra), 1

One ldstorder event was recorded for the store instruction. Remember that ProfileMe is a sampling process that chooses instructions at random for profiling, so certainly more than one load-store order replay trap occurred during execution. Only one such event was recorded because the frequency of occurrence was relatively small with respect to the frequency of wrong size replay traps in siz_replays.

Why did occ_replays run faster and why was the frequency of load-store order replay traps relatively small? The designers of the Alpha 21264 felt that load-store order traps would occur frequently enough in real programs to warrant preventive action. The first time that the 21264 detects the store-load hazard, it sets a bit in a load wait table. On subsequent executions, the load instruction is issued after the store instruction thus avoiding a costly replay trap. The wait table is periodically cleared (every 32K cycles) to avoid unnecessary waits. In the example above, the 21264 has "learned" to issue the store before the load. Load-store order replay traps still occur whenever the wait table is cleared, but the frequency of occurence is greatly diminished.

But, wait. There's more. The dcpiprof summary shows that 696 replay events were tallied against the program occ_replays.

dcpiprof -i -pm replays
...
replays                
 :count       %    cum% image
    694  94.04%  94.04% occ_replays
     43   5.83%  99.86% /vmunix
      1   0.14% 100.00% /usr/shlib/libc.so

Where are the other replays? The following dcpilist transcript shows the distribution of replay events in the function write_read.

dcpilist -pm replays write_read occ_replays
write_read:
replays 
 :count 
      0      0x120001180   bis     zero, 0x23, t0
    645      0x120001184   stq     t0, 0(a0)
     49      0x120001188   ldq     v0, 0(a1)
      0      0x12000118c   ret     zero, (ra), 1

One of the replay events on the store instruction is the ldstorder event. The other replay traps are due to the short, memory-intensive nature of the loop. The Alpha 21264 has a load queue and a store queue to hold pending load and store instructions, respectively. The queues let the processor core run ahead of the relatively slow memory system. When a load (store) operation arrives at a full queue, the load (store) instruction is replayed. The instruction proceeds when an empty queue entry becomes available. The load and store instructions in write_read are within a very short loop with little CPU work to be performed while the memory system is completing read and write operations. The load and store queues become full and the instructions are replayed.

The program siz_replays also takes some load-store replay traps, which can be exposed using dcpiprof and dcpilist. Short, memory-intensive loops in particular involve complex interactions between CPU and memory units, making analysis difficult.

Clearly, the difficult part of the analysis is determining why the replay traps occurred. The chief and common condition in load-store, load-load, load-miss and wrong size replay traps is that two or more memory instructions are attempting to access the same memory location within a shared issue-execution neighborhood. A load or store instruction taking replay traps must be interfering with the execution of some other nearby load or store instruction (or vice versa.) You should first relate the instructions back to the source code from which they came. Then, think about access to memory data structures within the local neighborhood of the culprit instruction in the source code.

Can array and other data structure accesses map to the same address? If so, does this happen frequently?
Do two or more data structures overlap in memory? Will the access pattern into those data structures frequently map into the same address at the same time?
Does the program manipulate fields within a memory word and then read/write the word as a whole? Is data treated in a polymorphic way (union datatypes) that can lead to wrong size replay traps?

Cache line contention replay traps are similar, but instead of identical addresses, conflict occurs between accesses to the same cache line. This is likely to happen when data structures are placed on 32Kbyte address boundaries or when programs walk through memory with a stride of 32Kbytes.

The elimination of memory system replay traps may require a rethinking of algorithm, placement of data structures in memory, size/padding of data structures in memory, indexing and access pattern. Alpha programmers, in general, must given careful consideration to the memory-related behavior of their programs. Here are some suggestions for avoiding replay traps.

If a program writes to a variable in memory and then immediately reads from the same variable, replay traps may result. It would be better to write the value into memory and to retain a copy in a separate temporary variable, for which the compiler can allocate a register.
A program should not walk through arrays using a stride of 32Kbytes. Replays aside, cache memory design favors sequential, stride one array access.
When loops have a common stride through two or more arrays, the arrays should be padded to assure that the first and subsequent accesses do not fall on the same cache line.
Avoid code that mixes the manipulation of individual fields/bytes within a memory word with full word reads/writes of the same memory word.

The Alpha 21264 provides instructions to move data between integer and floating point registers. Earlier Alpha implementations did not have these instructions, forcing the transfer through a common memory location on the stack and thereby increasing the possibility of load-store replays when run on an Alpha 21264 host. Recompiling for the Alpha 21264 (

-arch
ev6

) may help if the program was compiled for an earlier Alpha implementation.

The Compaq C compiler tries to avoid replay traps when compiling code for the Alpha 21264. The compiler will analyze pointers and the placement of data structures in memory in order to make certain optimizations. In general, a compiler has only a limited ability to determine if two data structures occupy the same area in memory or if two pointers refer to the same data item. If the compiler cannot make this determination with certainty, it must generate conservative code which is functionally correct under all circumstances. Conservative code will likely have more explicit memory access operations and consequently more opportunity for memory system replay traps. Further, the compiler cannot adjust the scheduling of instructions (that is, how instructions are placed in program order) to increase the spatial and temporal distance between memory accesses to the same data items. Modern compilers can, in general, do a better job of code generation and optimization when the programmer is explicit about data structures and the program's access into them.

5.4 Misses in the memory system.

Memory hierarchy provides a cost efficient trade-off between memory size and memory speed. It also compensates for the speed imbalance between the high clock and processing rate of the CPU and the much slower speed of bulk memory devices like large capacity RAM and disk drives. Effective use of memory hierarchy depends upon the exploitation of spatial and temporal locality -- having data where it is expected when it is expected. This usually means having the next data item to be used close, in space and time, to the most recently used data item.

There are four key levels in the memory hierarchy of a typical uniprocessor system based on the Alpha 21264.

Level 1 instruction and data caches (L1 I-cache and D-cache.)
Level 2 instruction and data cache (L2 cache, sometimes called "B-cache.")
Primary memory (organized by virtual and physical memory pages.)
Backing store for virtual memory pages.

Backing store uses disk drives and the I/O system to keep copies of virtual memory pages when they are not resident in primary memory. We will not treat backing store and I/O here, but we need to be aware of its presence because non-resident virtual memory pages reside there and access time to a non-resident page is quite long. Discussion will focus mainly on the first three levels.

The L1 I-cache and D-cache are implemented on the same chip as the processor core. They provide the fastest access to instructions and data, but are relatively small (64Kbytes each.) The processor looks in these caches first for instructions and data. If needed data or instructions are not found there, the processor looks to the L2 cache. The L2 cache is a unified cache meaning that it contains both instructions and data, unlike L1 where separate caches are dedicated to instructions and data. The L2 cache is off-chip and is a bit larger, 1 to 16 Mbytes depending upon the system implementation. The XP-1000 workstation, for example, has a 1Mbyte L2 cache. If needed data or instructions are not found in L2 cache, then the processor looks in primary memory as governed by the mapping of the virtual (program) effective address to physical effective address. The data or instructions are returned to the caches and CPU if available within a physical memory page. If the page is non-resident, it must be brought into primary memory from backing store, the data returned, and the program restarted.

Every transition between levels costs time and performance. Thus, it is to the program's benefit to keep instructions and data ready for access in the fastest part of the memory hierarchy. ProfileMe provides event types to assist the location of memory system performance problems. We'll discuss these events starting with primary memory and working in to the central processor.

5.4.1 Virtual memory pages

A page fault occurs when memory access needs data/instructions from a virtual memory page and the page is not resident in memory. The dfaulttrap ProfileMe event indicates that the instruction caused a data stream fault because the page was inaccessible or because the virtual address was malformed (e.g., not properly sign extended.) Data stream faults are easy to locate -- just capture a ProfileMe database and use dcpiprof/dcpilist to narrow the location of dfaulttrap events to images, procedures and instructions.

To avoid data stream faults, a program should not skip from page to page and should stay within a virtual memory page as long as possible. Page faults have a very high penalty since the program must be suspended until the needed page can be read from disk. This is a delay of many millions of processor cycles.

5.4.2 L2 cache memory

The Level 2 (L2) cache provides a large, relatively fast cache memory between primary memory and the on-chip L1 cache. The L2 cache is sometimes called the "B-cache" or "board-level cache." Future Alpha processors may implement the L2 cache on-chip, so we prefer the term "L2" when referring to the second level cache. As mentioned earlier, the L2 cache contains both instructions and data. L2 cache lines and items are referenced by physical address. The L2 cache is direct mapped meaning that a given cache line holds information for exactly one address-to-block association (64bytes per block) and that a line is quickly selected/indexed by a subfield within a pending address. If two data or instruction streams contend for the same L2 cache line(s) within a critical inner loop, the cache will thrash as the older entry is continually over-written by information from the newer access. The large size of the L2 cache reduces the probability of thrashing, but the possibility does exist. Of course, thrashing results in poor program performance.

We will use a simple program, walk, to illustrate misses in the memory system, including the L2 cache. Here is the source code for the program walk, which walks through a large array and sums the elements of the array. The array walking loop is executed repeatedly to increase the execution time of the program and to capture a larger number of miss events.


#define ARRAY_SIZE (16*1024*1024)
#define LOOP_COUNT (256)

long int array[ARRAY_SIZE] ;

main()
{
  int i, j ;
  long int sum ;

  sum = 0 ;

  for (i = 0 ; i < LOOP_COUNT ; i++) {
    for (j = 0 ; j < ARRAY_SIZE ; j++) {
      sum = sum + array[j] ;
    }
  }
}

We ran the program and collected data under Tru64 on an XP-1000 workstation. The results are shown in the examples below.

DCPI/ProfileMe provides four predefined counter configurations that can be used to measure four performance-related events: number of retires, inflight cycles, B-cache misses, and memory system replay traps. Each counter configuration measures two of these four event types. The argument to the -event option in the dcpid command selects the counter configuration as well as enabling the collection of ProfileMe instruction events. The default configuration (-pm) counts inflight cycles and retire delay, so we used the configuration -pm2 to count retires and B-cache misses.

14 > dcpid -event pm2 $DCPIDB ; walk ; dcpiquit

After collecting data, we used dcpiprof to break down B-cache misses by image and procedure.

15 > dcpiprof -i -pm ret:bcmisses
...
  retired                
:bcmisses       %    cum% image
   369463  99.58%  99.58% walk
     1249   0.34%  99.92% /vmunix
      268   0.07%  99.99% /dsk0h/dcpidb/PALcode
       11   0.00%  99.99% /usr/shlib/libpthread.so
        7   0.00%  99.99% unknown@positron
        7   0.00%  99.99% /usr/shlib/X11/libdix.so
        5   0.00% 100.00% /usr/shlib/X11/libos.so
        5   0.00% 100.00% /usr/shlib/libc.so
        4   0.00% 100.00% /usr/shlib/libXt.so
        3   0.00% 100.00% /usr/sbin/xntpd
        3   0.00% 100.00% /usr/bin/dcpid
        1   0.00% 100.00% /sbin/loader
16 > dcpiprof -pm ret:bcmisses walk
...
  retired                
:bcmisses       %    cum% procedure                 image
   369463 100.00% 100.00% main                      walk

The B-cache misses have occurred in the procedure main as expected since it contains the inner program loop.

As usual, we used dcpilist to display an instruction-by-instruction breakdown of events. In this case, we invoked dcpilist on main in the program image walk.

39 > dcpilist -pm ret+ret:bcmisses main walk
main:
retired   retired 
 :count :bcmisses  freq   
      0         0     0        0x120001150   ldah    gp, 8192(t12)
      0         0     0        0x120001154   ldq_u   zero, 0(sp)
      0         0     0        0x120001158   lda     gp, 28432(gp)
      0         0     0        0x12000115c   ldq_u   zero, 0(sp)
      0         0     0        0x120001160   bis     zero, zero, t0
      0         0     0        0x120001164   bis     zero, zero, t1
      0         0     0        0x120001168   ldah    t2, 256(zero)
      0         0     0        0x12000116c   ldq     t3, -32624(gp)
      0         0     0        0x120001170 : bis     zero, zero, t4
      0         0     0        0x120001174   bis     zero, t3, t5
      0         0     0        0x120001178   ldq_u   zero, 0(sp)
      0         0     0        0x12000117c   ldq_u   zero, 0(sp)
  67561     58616 67759        0x120001180 : ldq     t7, 0(t5)
  67868     58798 67759        0x120001184   addl    t4, 0x1, t4
  67912     58964 67759        0x120001188   lda     t5, 8(t5)
  67945     58938 67759        0x12000118c   cmplt   t4, t2, a0
  67760     67200 67759        0x120001190   addq    t0, t7, t0
  67510     66947 67759        0x120001194   bne     a0, 0x120001180
      0         0     0        0x120001198   addl    t1, 0x1, t1
      0         0     0        0x12000119c   lda     a1, -256(t1)
      0         0     0        0x1200011a0   blt     a1, 0x120001170
      0         0     0        0x1200011a4   bis     zero, zero, v0
      0         0     0        0x1200011a8   ret     zero, (ra), 1
      0         0     0        0x1200011ac   call_pal halt

The inner loop is clearly visible where retire and B-cache miss events have accumulated. The 21264 B-cache miss counter is a "free running" counter and does not associate a B-cache miss with a particular instruction via instruction sampling. This counter behavior plus the rather short, memory-intensive nature of the loop, smears the B-cache miss events across the instructions in the loop. This makes the detective work of miss analysis a little more difficult.

Good cache behavior is essential to good program performance. A program should work within a small neighborhood of a few data structures, complete its processing there, and then move on to the next neighborhood. Memory data is more likely to be available in cache and to be reused if it was used recently or if its immediate neighbor within the same cache block was used recently. Similar principles hold for program instructions.

When walking an array, use a single stride, sequential access pattern. This kind of behavior exhibits good spatial and temporal locality. C stores array elements in row major order while FORTRAN lays out array elements in column major order. In C, the two-dimensional array:

a₀₀ a₀₁ a₀₂
a₁₀ a₁₁ a₁₂
a₂₀ a₂₁ a₂₂

is arranged in memory in the following way:

a₀₀ a₀₁ a₀₂ a₁₀ a₁₁ a₁₂ a₂₀ a₂₁ a₂₂

from lowest to highest address. The rightmost index should change fastest to obtain sequential access. In FORTRAN, the same array has the arrangement:

a₀₀ a₁₀ a₂₀ a₀₁ a₁₁ a₂₁ a₀₂ a₁₂ a₂₂

again, from lowest to highest address. Changing the leftmost index fastest yields sequential access. Programmers who are porting software from FORTRAN to C should interchange indices where possible.

We'll discuss and analyze cache behavior again in later sections. However, before leaving discussion of the L2 cache, we must note that data prefetching is an important technique that improves cache memory performance. Prefetching starts the transfer of data into cache before it is needed immediately for computation. Section 5.4.4 illustrates the use and benefits of data prefetching by inserting prefetches into the example program walk. Prefetches improve the performance of both L1 and L2 cache.

5.4.3 Virtual to physical address mapping

The off-chip memory system uses physical addresses. The Alpha 21264 maps virtual (program or logical) addresses to physical addresses before communicating off-chip memory requests. Unless the operating system exploits contiguous page mapping, virtual memory is subdivided into 8Kbyte pages. The virtual memory pages belonging to a process (program) are described in a page table. Part of this description is where the physical representation of each page can be found in primary memory or on the backing store. This information controls the mapping of virtual addresses to physical, primary memory addresses.

The 21264 uses a hardware structure called a "translation look-aside buffer" to assist the mapping of virtual addresses. The 21264 has two look-aside buffers: the Data Translation Buffer (DTB) and the Instruction Translation Buffer (ITB.) The DTB and ITB each contain 128 entries, where each entry describes the mapping of one virtual memory page. The DTB and ITB are, effectively, cache memories that hold the most recently used address translation information. The operating system loads the DTB and ITB with information from the page table of the running process.

Since the DTB and ITB have 128 entries and pages are 8Kbytes in size, a full TLB can describe at most 1Mbyte of the entire virtual memory space. When translation information for a virtual address is not found in the appropriate TLB, the operating system is called to put the needed information into the TLB, overwriting one of the entries. This call to the operating system is time consuming and frequent TLB misses should be avoided.

ProfileMe provides a way to measure DTB and ITB misses. The excerpt below captures ProfileMe data for walk, displays an image-by-image summary of DTB misses, and identifies the procedure main in walk as the location of the DTB misses. A similar process can be used to isolate ITB misses using the ProfileMe itbmiss trap name instead of dtbmiss.

11 > dcpid -event pm $DCPIDB ; walk ; dcpiquit
dcpid: load   : loaded pcount driver
dcpid: monitoring pm
dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log
dcpictl: quit successful
12 > dcpiprof -i -pm dtbmiss
...
dtbmiss                
 :count       %    cum% image
     58  87.88%  87.88% walk
      5   7.58%  95.45% /vmunix
      2   3.03%  98.48% /dsk0h/dcpidb/PALcode
      1   1.52% 100.00% /usr/dt/bin/dtterm
13 > dcpiprof -pm dtbmiss walk
...
dtbmiss                
 :count       %    cum% procedure                 image
     58 100.00% 100.00% main                      walk

The next excerpt shows the annotated disassembly for main.

14 > dcpilist -pm retired+dtbmiss main walk
main:
retired dtbmiss 
 :count  :count  freq   
      0       0     0        0x120001150   ldah    gp, 8192(t12)
      0       0     0        0x120001154   ldq_u   zero, 0(sp)
      0       0     0        0x120001158   lda     gp, 28432(gp)
      0       0     0        0x12000115c   ldq_u   zero, 0(sp)
      0       0     0        0x120001160   bis     zero, zero, t0
      0       0     0        0x120001164   bis     zero, zero, t1
      0       0     0        0x120001168   ldah    t2, 256(zero)
      0       0     0        0x12000116c   ldq     t3, -32624(gp)
      0       0     0        0x120001170 : bis     zero, zero, t4
      0       0     0        0x120001174   bis     zero, t3, t5
      0       0     0        0x120001178   ldq_u   zero, 0(sp)
      0       0     0        0x12000117c   ldq_u   zero, 0(sp)
  67840      58 67658        0x120001180 : ldq     t7, 0(t5)
  67749       0 67658        0x120001184   addl    t4, 0x1, t4
  67802       0 67658        0x120001188   lda     t5, 8(t5)
  67500       0 67658        0x12000118c   cmplt   t4, t2, a0
  67611       0 67658        0x120001190   addq    t0, t7, t0
  67450       0 67658        0x120001194   bne     a0, 0x120001180
      0       0     0        0x120001198   addl    t1, 0x1, t1
      0       0     0        0x12000119c   lda     a1, -256(t1)
      0       0     0        0x1200011a0   blt     a1, 0x120001170
      0       0     0        0x1200011a4   bis     zero, zero, v0
      0       0     0        0x1200011a8   ret     zero, (ra), 1
      0       0     0        0x1200011ac   call_pal halt

The DTB miss events accumulate on the load instruction at 0x120001180 that reads the value of array[j] from memory.

The principles for reducing DTB misses and achieving better performance are similar to those for virtual memory page faults and caches. Access should be confined to a small section of memory until work is complete and then the program should move on to the next section. For data, this means planning array access carefully. Access into large data heaps should also be managed to avoid randomly skipping around in a heap causing frequent replacement of DTB entries. For instructions, procedures that call each other frequently should be on the same page or at least nearby in memory. Infrequently used procedures, like error handlers, should be located further away in memory to make room for more frequently called procedures.

5.4.4 L1 data cache behavior

The 21264 L1 data cache (D-cache) is on-chip and provides fast access to program data when the data is resident in the cache. Integer and floating point load instructions produce register results in 3 and 4 processor cycles, respectively. The D-cache holds 64Kbytes of memory data and is organized as a two-way set associative cache with 64 byte blocks. This means that the D-cache holds two address-to-block associations and that it can tolerate two data streams accessing addresses that map to the same cache index without replacing one of the address-to-block associations. The direct map approach in the L2 cache does not have this tolerance although the larger size of the L2 cache tends to mitigate the effect of line conflicts.

ProfileMe does not provide an event to directly measure D-cache misses. Instead, we can use retire delay to detect and infer the presence of D-cache misses and other memory system delays. Retire delay is a lower bound on the number of cycles that an instruction has delayed its retirement and the retirement of other instructions following it in program order. As we'll see in the example below, retire delay tends to accumulate on the instruction that consumes the result of a load instruction. A relatively high retire delay sometimes indicates the presence of a performance bottleneck.

Retire delay is captured in the usual way as shown below. If you have used the -pm option with dcpid, then you already have retire delay in the database and do not need to rerun dcpid.

14 > dcpid -event pm $DCPIDB ; walk ; dcpiquit
dcpid: load   : loaded pcount driver
dcpid: monitoring pm
dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log
dcpictl: quit successful

The -pm option uses the default ProfileMe counter configuration. This configuration also captures inflight, which is approximately the number of cycles that an instruction was inflight.

The dcpiprof program will display an image-by-image summary of retire delay and will also break down retire delay by procedure.

15 > dcpiprof -i -pm ret:retdelay
...
  retired                
:retdelay       %    cum% image
  1215336  97.98%  97.98% walk
    18548   1.50%  99.47% /vmunix
     6086   0.49%  99.96% /dsk0h/dcpidb/PALcode
      250   0.02%  99.98% /usr/bin/dcpid
       74   0.01%  99.99% /usr/shlib/libc.so
       49   0.00%  99.99% /usr/shlib/libXt.so
       40   0.00% 100.00% /usr/shlib/libX11.so
       28   0.00% 100.00% /usr/shlib/libpthread.so
        6   0.00% 100.00% /var/subsys/pcount.mod
        5   0.00% 100.00% /usr/shlib/libmach.so
        5   0.00% 100.00% /usr/shlib/X11/libos.so
        5   0.00% 100.00% unknown@positron
        4   0.00% 100.00% /usr/sbin/routed
        3   0.00% 100.00% /usr/shlib/X11/libdix.so
        2   0.00% 100.00% /usr/shlib/libevm.so
        1   0.00% 100.00% /usr/shlib/libcfg.so
16 > dcpiprof -pm ret:retdelay walk
...
  retired                
:retdelay       %    cum% procedure                 image
  1215336 100.00% 100.00% main                      walk

Although these breakdowns identify the program images and procedures with the most retire delay, retire delay is more meaningful at the instruction level. The following example shows the disassembly of the procedure main in walk.

17 > dcpilist -pm ret+ret:retdelay main walk
main:
retired   retired 
 :count :retdelay  freq   
      0         0     0        0x120001150   ldah    gp, 8192(t12)
      0         0     0        0x120001154   ldq_u   zero, 0(sp)
      0         0     0        0x120001158   lda     gp, 28432(gp)
      0         0     0        0x12000115c   ldq_u   zero, 0(sp)
      0         0     0        0x120001160   bis     zero, zero, t0
      0         0     0        0x120001164   bis     zero, zero, t1
      0         0     0        0x120001168   ldah    t2, 256(zero)
      0         0     0        0x12000116c   ldq     t3, -32624(gp)
      0         0     0        0x120001170 : bis     zero, zero, t4
      0         0     0        0x120001174   bis     zero, t3, t5
      0         0     0        0x120001178   ldq_u   zero, 0(sp)
      0         0     0        0x12000117c   ldq_u   zero, 0(sp)
  67840      3590 67658        0x120001180 : ldq     t7, 0(t5)
  67749         0 67658        0x120001184   addl    t4, 0x1, t4
  67802         0 67658        0x120001188   lda     t5, 8(t5)
  67500         0 67658        0x12000118c   cmplt   t4, t2, a0
  67611   1144229 67658        0x120001190   addq    t0, t7, t0
  67450     67517 67658        0x120001194   bne     a0, 0x120001180
      0         0     0        0x120001198   addl    t1, 0x1, t1
      0         0     0        0x12000119c   lda     a1, -256(t1)
      0         0     0        0x1200011a0   blt     a1, 0x120001170
      0         0     0        0x1200011a4   bis     zero, zero, v0
      0         0     0        0x1200011a8   ret     zero, (ra), 1
      0         0     0        0x1200011ac   call_pal halt

The load (ldq) instruction at 0x120001180 reads the array element array[j] from memory. After the value is read, it is stored in register t7. Although memory system misses are caused by the load, the retire delay accumulates on the add instruction (addq) that consumes the result of the load in register t7. Thus, the general procedure for analyzing misses is to identify the instruction with the highest retire delay and then work back in the program to find any load instructions that produce the register arguments to the delayed instruction. The load instructions are the likely culprits.

As an experiment, let's manually insert a prefetch instruction into the inner loop of walk to improve D-cache behavior. The Compaq C compiler supports a built-in function, asm, which can be used to insert assembler language instructions into a program. Here is the source for the modified program, pwalk.

#include <c_asm.h>

#define ARRAY_SIZE (16*1024*1024)
#define LOOP_COUNT (256)
#define PREFETCH   8

long int array[ARRAY_SIZE+PREFETCH] ;

main()
{
  int i, j ;
  long int sum ;

  sum = 0 ;

  for (i = 0 ; i < LOOP_COUNT ; i++) {
    for (j = 0 ; j < ARRAY_SIZE ; j++) {
      asm("ldl $31,($16)", &array[j+PREFETCH]) ;
      sum = sum + array[j] ;
    }
  }
}

The program first includes c_asm.h to define asm. The body of the loop remains the same except, a call to asm:

asm("ldl $31,($16)", &array[j+PREFETCH]) ;

has been inserted. The call passes in the address of the array element that is one cache line ahead. The address is made available to the prefetch instruction in register R16 ($16.) Thus, the prefetch instruction will begin the process of accessing the array elements to be used from the next cache line while the array elements in the current cache line are being read and summed.

Write hints can also be inserted into a program using asm:
asm(".arch ev6;wh64 ($16)", address)
A write hint tells the memory system that a cache block will be completely overwritten. The memory system clears the entry for this block in the cache without reading its data from memory. For this reason, write hint must be used carefully; its behavior is destructive.

We used the Tru64 time command to measure the execution time of the original walk program (no prefetch) and pwalk (prefetch inserted.) The walk program executes in 49.68 seconds and pwalk executes in 36.71 seconds. We used DCPI/ProfileMe to measure the total number of cycle samples within the inner loops of walk and pwalk.

Measurement	`walk`	`pwalk`
Execution time	117.44 seconds	58.93 seconds
Total cycles in inner loop	1,223,128	615,265

Execution statistics for `walk` and `pwalk`.

The ratio of inner loop cycles (1,223,128 / 615,265 = 1.98) is roughly the same as the ratio of program execution times (117.44 / 58.93 = 1.99.)

The ProfileMe measure inflight is approximately the number of instructions that an instruction is inflight. We measured and compared the inflight cycles for the inner loop of walk:

retired     valid
 :count :inflight  freq
...
  67840   8085612 67658        0x120001180 : ldq     t7, 0(t5)
  67749   8078173 67658        0x120001184   addl    t4, 0x1, t4
  67802   8098298 67658        0x120001188   lda     t5, 8(t5)
  67500   8037100 67658        0x12000118c   cmplt   t4, t2, a0
  67611   9160497 67658        0x120001190   addq    t0, t7, t0
  67450   9204279 67658        0x120001194   bne     a0, 0x120001180

and the inner loop of pwalk:

  68161   4296917 67735        0x120001190 : ldl     zero, 0(t5)
  67484   4252675 67735        0x120001194   ldq     t7, -64(t5)
  67990   4284110 67735        0x120001198   addl    t4, 0x1, t4
  67698   4263218 67735        0x12000119c   lda     t5, 8(t5)
  67578   3760276 67735        0x1200011a0   cmplt   t4, t3, a0
  67579   4299603 67735        0x1200011a4   addq    t0, t7, t0
  67658   4374610 67735        0x1200011a8   bne     a0, 0x120001190

The number of inflight cycles for each instruction has been dramatically reduced in pwalk. Retire delay is also reduced.

As in the case of virtual memory pages and B-cache behavior, D-cache performance is best when a program works within a small neighborhood of memory (32Kbytes or less on the Alpha 21264), completes processing within the neighborhood, and then moves on to the next one. Prefetching is a way to tell the memory system that a certain block of memory will be used next and that it should be brought into the D-cache. This will make that block of memory available in D-cache when it is actually read or written.

5.4.5 L1 instruction cache behavior

The L1 instruction cache (I-cache) provides quick access to the most recently used program instructions. The CPU looks here first for instructions before requesting them from the L2 cache, or if necessary, primary memory. The 21264 needs a steady, predictable stream of instructions to deliver high performance. The L1 I-cache quickly delivers instructions to the 21264 fetch unit in order to keep the rest of the processor pipeline busy. Like D-cache, performance suffers when I-cache miss rates are high.

The ProfileMe event nyp (Not Yet Prefetched) indicates that when the 21264 fetch unit requested a block containing an instruction, the instruction was not in the I-cache and that an off-chip request for the instruction had not yet been initiated. If this event is asserted, then the instruction fetch operation definitely caused an I-cache miss stall. There are conditions under which an I-cache miss can still occur and nyp will not be asserted. Thus, nyp tends to understate the occurrence of I-cache miss stalls.

Generally, I-cache behavior is best when the most frequently executed repetitive code such as an inner loop or group of closely interrelated subroutines are fully contained in the instruction cache. This minimizes the likelihood of I-cache collisions between critical stretches of code. Code placement using a high level language is indirect at best. But, interrelated, frequently executed functions could be grouped together in a source module as the compiler will lay the functions down in order in memory. The two-way set associative organization of the Alpha 21264 I-cache can tolerate two-way conflicts. Earlier Alpha implementations with direct mapped I-cache are more sensitive to collisions.

I-cache behavior is also sensitive to procedure in-lining. Benefit can be gained by expanding in-line a frequently called procedure, such as a function that is called from the body of a critical inner loop. Procedure in-lining reduces call overhead, reduces branch mispredicts, and increases opportunities for optimization between the call site and the body of the procedure itself. Expansion often increases overall code size, perhaps making a critical inner loop so big that it no longer fits in the I-cache resulting in poor I-cache behavior and performance. Thus, in-lining should be used with caution -- measure test runs with and without procedure in-lining to see which is best.

5.5 Unaligned data access

The Alpha instruction set architecture (ISA) favors the use of aligned memory items. An aligned data item can be read or written with a single load or store instruction. Unaligned memory accesses usually require several instructions to update a data item that straddles an aligned address boundary.

C compilers for Alpha assume aligned memory access. The compiler aligns data structures and generates small, fast code. Sometimes, however, programmers make bad assumptions about data alignment, type casting or pointer arithmetic. Or perhaps a program was written for a specific (non-Alpha) architecture and it embodies architecture-specific assumptions about data alignment. When run on Alpha, the program may use pointer/address values that are not aligned according to the Alpha ISA. These unaligned memory accesses cause alignment traps, which are directed to the operating system. The operating system fixes up the instruction, completes the memory read or write operation, and returns control to the program. These "alignment fix-ups" are costly and can rob program performance.

We will illustrate the effect of unaligned memory accesses and traps with the simple program below (unalign.) The program deliberately causes alignment traps by computing an unaligned address and then reading a 32-bit word at that address.

unsigned char two_64bit_words[16] ;

unsigned int f(unsigned int *pInt)
{
  return( *pInt ) ;
}

main()
{
  int i ;
  unsigned char *pChar ;
  unsigned int *pInt ;
  unsigned int  vInt ;

  pChar = two_64bit_words ;

  for (i = 0 ; i < 8 ; i++) two_64bit_words[i] = 0xFF ;
  for (i = 8 ; i < 16 ; i++) two_64bit_words[i] = 0x0 ;

  for (i = 0 ; i < 80000000 ; i++) {
    pInt = (unsigned int *)(pChar+6) ;
    vInt = f(pInt) ;
  }

  printf("pInt is %8x\n", pInt) ;
  printf("vInt is %8x\n", vInt) ;
}

The function f takes a pointer to an unsigned integer and returns the value of the integer. When f is compiled, it produces a single instruction (ldl) to read the value of the unsigned, 32-bit integer from memory (alignment assumed.)

The array two_64bit_words is allocated on a quadword aligned address boundary. The quadword aligned base address of the array is stored in the pointer variable pChar. The function main calls f with a pointer into two_64bit_words that has been forced to an unaligned address:

pInt = (unsigned int *)(pChar+6) ;

The address arithmetic and cast to a different pointer type should start alarm bells ringing since this assignment effectively goes around C language type enforcement. We chose an offset of six bytes to force the unsigned integer across a word aligned address as illustrated below. The "ABCD" markers denote the four bytes in the unsigned integer to be read. Aligned 32-bit integer boundaries are at indices 0, 4, 8 and 12, where an aligned access could be performed.

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
						A	B	C	D

The session excerpt below runs the program unalign and captures ProfileMe data. The dcpiprof commands narrow the occurrence of unaligntrap events to the function f in the program unalign.

58 > uac p noprint
59 > dcpid -event pm $DCPIDB ; unalign ; dcpiquit
dcpid: load   : loaded pcount driver
dcpid: monitoring pm
dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log
pInt is 400001a6
vInt is     ffff
dcpictl: quit successful
60 > dcpiprof -i -pm unaligntrap
...
unaligntrap                
     :count       %    cum% image
       1362 100.00% 100.00% unalign
61 > dcpiprof -pm unaligntrap unalign
...
unaligntrap                
     :count       %    cum% procedure                 image
       1362 100.00% 100.00% f                         unalign

The following dcpilist excerpt displays an annotated disassembly for f showing the number of retire events and unaligntrap events.

62 > dcpilist -pm ret+unaligntrap f unalign
f:
retired unaligntrap 
 :count      :count freq   
      0        1362  615        0x1200011b0   ldl     v0, 0(a0)
   1230           0  615        0x1200011b4   ret     zero, (ra), 1
      0           0    0        0x1200011b8   ldq_u   zero, 0(sp)
      0           0    0        0x1200011bc   ldq_u   zero, 0(sp)

The load instruction (ldl) reads the 32-bit integer at the memory location specified by the address in register a0, which holds the pointer argument to the function f. The load instruction is the site of many unaligned address traps since we forced the pointer to be an unaligned address. None of the load instructions retired because the operating system was called to fix-up and complete the trapped instruction on behalf of the hardware and the program.

Hint: If you try this example on Tru64 UNIX, be sure to enter the uac p noprint command. By default, Tru64 does not let alignment faults go by silently. This command disables alignment error messages, and without it, your display screen will fill with error messages. See the man page on uac for more details.

The Tru64 time command shows that unalign consumes 6.32 execution seconds of user (program) time and 29.35 seconds of operating system time:

63 > time unalign
pInt is 400001a6
vInt is     ffff
6.32u 29.35s 0:35 99% 0+1k 0+1io 0pf+0w

The program is spending over three times as much time in the operating system as it is spending in the program itself. The inefficiency is even worse than it appears because time in the operating system PALcode is accumulated in user time. The image-by-image summary of retired instructions tells the tale:

64 > dcpiprof -i -pm retire
...
retired                
 :count       %    cum% image
 310682  72.57%  72.57% /vmunix
 108538  25.35%  97.92% /dsk0h/dcpidb/PALcode
   8754   2.04%  99.96% unalign
     86   0.02%  99.98% /usr/bin/dcpid

Only 8,754 application instructions were retired while nearly 419,220 instructions were spent on alignment fix-ups.

The Compaq and GNU C compilers can be directed to assume the use of unaligned data and to generate conservative Alpha code that will not cause alignment traps. This quick fix may, at first, appear attractive. However, alignment traps usually indicate a conceptual error somewhere in the program and they should instead be treated as a bug and should be properly corrected and removed. ProfileMe identifies the offending load/store instructions, which can easily be related back to the source code for further investigation.

5.6 Resource exhaustion

Effective exploitation of instruction level parallelism (ILP) depends upon the availability of internal computational resources. Resources in the Alpha 21264 pipeline include multiple integer function units, multiple floating point function units, load queue, store queue, etc. In order to size these resources, the chip architects simulated and analyzed real world programs and made trade-offs between program performance and the number and size of on-chip resources. Although the Alpha 21264 is able to support a high degree of ILP, the number and capacity of its computational resources are limited. If a particular stretch of code (temporarily) uses up a particular resource and another instruction arrives and makes its request for the resource, then that instruction must wait for the resource to become free and available again before it may proceed. Sometimes this wait results in a pipeline stall and lost processor cycles.

The 21264 uses queues to balance processing rates between certain pipeline stages and central processing units.

The integer instruction queue and the floating point instruction queue hold instructions that have been mapped and are ready to be selected for issue.
The load queue and the store queue hold pending memory read and write operations to be completed by the memory unit.

The queues allow independent operation of the producing and consuming units to proceed concurrently. A queue fills when the producer is inserting new items into the queue at a higher rate than the consumer is removing items. When a queue becomes full the producer must wait until the consumer has removed an item before it can proceed again.

The load and store queues fill when a large number of read/write operations have been initiated and the memory unit cannot keep up with the requests. This is most likely to occur in code that is load/store intensive. Again, the solution is to thin out or reduce the density of load/store operations. Bursts of load/store activity should be separated by computation, allowing the memory unit to work off the backlog of memory requests before making new requests. Data prefetching will also help as data will be available in faster L1 D-cache and the memory unit will be able to process memory requests at a much higher rate.

The instruction queues may become full when the fetch and mapping units have produced more instructions than can issue into the integer and floating point execution units. The mapping stage stalls when an instruction queue is full since it does not have a place to put newly mapped instructions. This may be due to poor instruction scheduling and/or a lack of availability of function units. Certain kinds of operations must be performed in a particular function unit. For example, MIN/MAX and PACK/UNPACK MVI operations must be performed in unit zero of the upper integer subcluster, integer multiply must be performed in unit one of the upper integer subcluster and floating point multiply must be executed in the floating point multiplier. If a large number of instructions needing a particular unit occur within a small neighborhood of the program, then contention for the unit, and waiting instructions, will result. Conceivably, if the number of waiting instructions is large enough, the instruction queue will fill. A corrective action is to "thin out" instructions with specialized needs, separating them by instructions that do not create additional contention for the function unit in demand. Unfortunately, a higher level language programmer may only be able to indirectly affect the instructions laid down by the compiler by rearranging source code.

The mapping stage may also stall due to a shortage of physical registers. Physical registers hold intermediate results and are allocated from a pool of integer registers and a pool of floating point registers. A physical register is allocated to an instruction when the instruction is mapped and placed into the instruction queue. The register remains allocated while the instruction is in flight. The physical register is released when its instruction is retired. Of the 80 physical integer registers, 33 registers are available to hold the results of instructions that are inflight. Of 72 physical floating point registers, 37 registers are are available to hold the results of inflight instructions. The mapping stage of the pipeline will stall:

If 33 instructions that require an integer physical register have been mapped and are inflight, and an additional integer physical register is requested.
If 37 instructions that require a floating point physical register have been mapped and are inflight, and an additional floating-point physical register is requested.

The peak sustainable rates of physical register allocation are 3.3 integer registers per cycle and 2.3 floating point registers per cycle, assuming a mix of load, store and operate instructions and hits in the D-cache.

See Compiler Writer's Guide for the Alpha 21264 for details on register availability and allocation.

The mapping stage has one other dynamically allocated resource: the instruction number or "inum." An instruction number is assigned to each instruction and is used for bookkeeping during execution. There are 80 inums available. The mapping stage may stall due to a shortage of instruction numbers, but in practice, this is unlikely.

The ProfileMe map_stall event is asserted when an instruction stalls after it was fetched and before it was mapped. Map stalls can be isolated to particular images and procedures in the usual way (dcpiprof and dcpilist.) Stalls at the mapping stage can be reduced through better code scheduling to minimize the amount of time that instructions remain inflight. Universal no-op instructions (UNOP) can be inserted into the code to reduce fetch bandwidth, thereby slowing down the pipeline stages immediately preceding the mapper. Unfortunately, a programmer working in a higher level language can only indirectly the code schedule by rearranging the source program.

5.7 Floobydust

The Alpha 21264 and ProfileMe support other event types and trap bits. This section provides a brief description of them.

"Floobydust" is a contemporary term derived from archaic Latin miscellaneous, whose disputed history probably springs from Greek origins (influenced, of course, by Egyptian linguists) -- meaning here "a mixed bag." From Audio Handbook, © 1976 by National Semiconductor Corporation.

The early_kill and late_kill events, when asserted, specify that an instruction was killed early in the pipeline (before issue) or late in the pipeline (after issue.) Early and late kill give some indication of the cost associated with a killed instruction. Instructions killed after issue have already consumed resources and may have been partially executed. Thus, late kills are more expensive than early kills.

The trap bits include exception conditions that occur very infrequently, or not at all in a well-behaved program. These trap bits are:

arithtrap: The instruction caused an arithmetic trap.
fpdisabledtrap: An attempt was made to execute a floating point instruction while floating point instructions were disabled.
MT_FPCRtrap: The floating point control register (FPCR) is emulated on the Alpha 21264. This trap is taken when execution of an MT_FPCR (move to FPCR) instruction is attempted.
iacvtrap: The instruction caused an I-stream access violation trap. An illegal access to the instruction was attempted as specified by the current processor mode and page protection.
OPCDECtrap: An attempt was made to execute a reserved instruction or execute a privileged instruction in user mode.
interrupt: The instruction was preempted by an interrupt.
mchktrap: The instruction caused a machine check trap. Machine checks occur when the hardware detects a hardware error condition.

With the exception of interrupt, these bits indicate the presence of a serious error condition or bug.

Conclusion

We provided a brief introduction to the use of DCPI/ProfileMe to identify performance problems on Alpha, concentrating on the Alpha 21264A processor. ProfileMe and the 21264A use a new approach to performance analysis called "instruction sampling." Although instruction sampling differs from program counter sampling, it provides precise information about instruction-level program behavior.

Acknowledgements

Internal notes and presentations by Mark Vandevoorde, Mike Burrows, Kip Walker, Steve Root, Jeff Dean, Jamey Hicks, Mark Davis, Michael Bedy and Phil Ezolt were essential to the preparation of this paper. Thanks!

Special thanks go to Sharon Smith whose comments vastly improved this paper, especially the introduction. Thanks also go to Mark Davis for his careful review and for contributing replay examples. His contributions made the replay section much stronger.

References

The first three papers below describe the inner workings and application of ProfileMe and DCPI. The third paper is an analysis of the Compaq ES40 system, which is based on the 21264. ProfileMe was used in the analysis of the ES40.

ProfileMe: hardware support for instruction-level profiling on out-of-order processors, Dean, J., Hicks, J.E., Waldspurger, C.A., Weihl, W.E., and Chrysos, G., Proceedings Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, December 1997, pages 292 - 302.

Continuous Profiling: Where have all the cycles gone?, J. Anderson, L.M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.-T. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W.E. Weihl, SRC Technical Note 1997-016a, Also published in Proceedings of the 16th Symposium on Operating System Principles and ACM Transactions on Computer Systems, Volume 15, Number 4, November 1997, pages 357-390.

Transparent, Low-Overhead Profiling on Modern Processors, Workshop on Profile and Feedback-Directed Compilation, Paris, France, 13 October 1998.

Performance analysis of the Alpha 21264-based Compaq ES40 system, Cvetanovic, Z. and Kessler, R.E., Proceedings of the 27th International Symposium on Computer Architecture, 2000, June 10-15 2000, pages 192 - 202.

The following papers are "higher-level" introductions to the Alpha 21264 processor. They provide more detail on the inner workings of the 21264 chip.

The Alpha 21264 microprocessor, Kessler, R.E., IEEE Micro, March-April 1999, pages 24 - 36.

The Alpha 21264 microprocessor architecture, Kessler, R.E., McLellan, E.J., and Webb, D.A., Proceedings. International Conference on Computer Design: VLSI in Computers and Processors, 1998. ICCD '98, October 5-7, 1998, pages 90 - 95.

If you would like to build a 21264, here's where to start. These papers are mainly of interest to logic/circuit designers.

The Alpha 21264: a 500 MHz out-of-order execution microprocessor, Leibholz, D. and Razdan, R.. Proceedings of IEEE Compcon '97., February 23-26, 1997, pages 28 - 36.

A 600 MHz superscalar RISC microprocessor with out-of-order execution, Gieseke, B.A., Allmon, R.L., Bailey, D.W., Benschneider, B.J., Britton, S.M., Clouser, J.D., Fair, H.R., III, Farrell, J.A., Gowan, M.K., Houghton, C.L., Keller, J.B., Lee, T.H., Leibholz, D.L., Lowell, S.C., Matson, M.D., Matthew, R.J., Peng, V., Quinn, M.D., Priore, D.A., Smith, M.J., and Wilcox, K.E., Digest of Technical Papers: IEEE International Solid-State Circuits Conference, 1997. 43rd ISSCC., 1997, pages 176 - 177, 451.

High-performance microprocessor design, Gronowski, P.E., Bowhill, W.J., Preston, R.P., Gowan, M.K., and Allmon, R.L., IEEE Journal of Solid-State Circuits, May 1998, pages 676 - 686.

Designing High Performance Microprocessors, Gronowski, P., Digest of Technical Papers: Symposium on VLSI Circuits, June 1997, pages 51 - 54.

Design tradeoffs in stall-control circuits for 600 MHz instruction queues, Fischer, T. and Leibholz, D., Digest of Technical Papers. 1998 IEEE International, Solid-State Circuits Conference, 1998, February 5-7, 1998, pages 232 - 233, 442.

Issue logic for a 600-MHz out-of-order execution microprocessor, Farrell, J.A. and Fischer, T.C., IEEE Journal of Solid-State Circuits, May 1998, Volume: 33 Issue: 5, pages 707 - 712.

Timing verification of the 21264: A 600 MHz full-custom microprocessor, Shriver, E.J., Hall, D.H., Nassif, N., Rahman, N.E., Rethman, N.L., Watt, G., and Farrell, J.A., Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1998, ICCD'98, October 1998, pages 96 - 103.

Circuit implementation of a 600 MHz superscalar RISC microprocessor, Matson, M., Bailey, D., Bell, S., Biro, L., Butler, S., Clouser, J., Farrell, J., Gowan, M., Priore, D., Wilcox, K., Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1998, ICCD'98, October 1998, pages 104 - 110.

A 600-MHz superscalar floating-point processor, Clouser, J., Matson, M., Badeau, R., Dupcak, R., Samudrala, S., Allmon, R. and Fairbanks, N. IEEE Journal of Solid-State Circuits, July 1999, Volume 34, Number 7, pages 1026 - 1029.

A 600 MHz CMOS PLL microprocessor clock generator with a 1.2 GHz VCO, von Kaenel, V., Aebischer, D., van Dongen, R., and Piguet, C., Digest of Technical Papers. 1998 IEEE International Solid-State Circuits Conference, February 5-7, 1998, pages 396 - 397.

Clocking design and analysis for a 600 MHz Alpha microprocessor, Fair, H. and Bailey, D., Digest of Technical Papers. 1998 IEEE International Solid-State Circuits Conference, February 5-7, 1998, pages 398 - 399.

Additional tips and information about attaining higher performance on Alpha may be found in the resources below.

Performance tips for Alpha Linux C programmers, Compaq Computer Corporation, December 2000.

Compaq Computer Corporation, Compiler Writer's Guide for the Alpha 21264, Order number EC-RJ66A-TE, June 1999.

Compaq Computer Corporation, Tru64 UNIX Programmer's Guide, August 1999.

Appendix A: List of performance tips

Here is a quick summary of the tips in Performance tips for Alpha Linux C programmers.

Tip #1 Use the Compaq C compiler. It has been specially tuned for Alpha.

Tip #2 Identify the Alpha implementation in your machine. Tell the compiler to produce code for that particular implementation.

Tip #3 Use hand-tuned library functions wherever possible.

Tip #4 Ask the compiler to unroll loops.

Tip #5 Ask the compiler to generate procedures in line.

Tip #6 Use C constructs that encourage the compiler to generate conditional move instructions instead of branches.

Tip #7 Use local variables to help the compiler analyze and optimize code.

Tip #8 Consider data cache size when sizing and stepping through arrays.

Tip #9 Use aligned data structures and data access.

Tip #10 Detect, identify and remove performance robbing alignment faults.

Tip #11 Help the compiler to keep data items as close to the processor as possible.

Tip #12 Help the compiler exploit software pipelining by keeping inner loop code straightforward and by keeping dependencies simple.

Tip #13 Design and code for good temporal and spatial locality.

Tip #14 Avoid array sizes and strides that are multiples of the cache size.

Tip #15 Use data structures and code organization that are "virtual memory ready."

Appendix B: Tru64-specific directions

There are four key elements in DCPI: the DCPI driver (pcount), the DCPI daemon (dcpid), the DPCI database and the analysis programs.

The driver and the DCPI daemon work together to collect information about a running application. The driver is a module in the Tru64 kernel that samples the hardware program counters. This information is, in turn, passed to the DCPI daemon. As the volume of collected data grows, the DCPI daemon writes it to the DCPI database. The DCPI database is the link between the DCPI daemon and the analysis programs. After collection is completed, the analysis programs read data from the database, analyze it, and prepare results.

The DCPI daemon must be run by the superuser. It is possible to run the DCPI daemon using setuid-root. However, this method is not secure and it is not recommended.

The DCPI installation script is interactive. If so directed, the installation script will create a database for profile data, will generate an initial scan map, and will put the scan map into the database directory. The user must specify the path to the database directory. The initial scan map file, default_scanmap, is a list of program and library images. When the installation script creates the scan map, it traverses the file system looking for program and library images to add to the list. Usually, this process takes several minutes, but it may take longer to complete if the file system has a large number of application packages installed. The scan map can be recreated or refreshed using the dcpiscan program. The scan map file can be copied to new or different DCPI collection databases. The scan map helps the DCPI daemon to identify binary images.

This paper assumes that a database directory was created during installation. In the examples, the database directory is located at /dsk0h/dcpidb. We defined the environment variable, DCPIDB, to the path to the database directory. This makes it easier to specify the database directory path to the DCPI daemon, dcpid. The programs dcpiprof and dcpilist take the database directory path from the DCPIDB variable.

After installion, it's time to collect data. Set the value of DCPIDB to the directory path.

35 > setenv DCPIDB /dsk0h/dcpidb

Next, start the DCPI daemon, specifying the profile event data to collect and the directory path.

36 > dcpid -event cycles $DCPIDB

The -event option specifies the kind of profile information to collect. The dcpid command starts the daemon running in the background and leaves the shell "live" for more commands. Now it's time to run a CPU intensive application like dilate, which performs morphological dilation on an image:

37 > dilate heart.ima ball out.ima

After the program runs, run dcpiquit to write out any remaining performance data in the driver/daemon buffers and stop the collection process.

38 > dcpiquit

The dcpiflush command tells the daemon to flush data without stopping collection. dcpiquit ends the current measurement epoch (period.) The dcpiepoch command closes the current epoch and starts a new epoch without stopping the daemon and ending collection. This allows more flexibility during measurement. The analysis commands all provide several ways to select a particular epoch (and associated set of profile data) for analysis. See the man page for

dcpiepoch
for more information.



Disclaimer: The execution times cited in this technical note are
solely intended to compare the relative merits of certain coding
practices, optimizations, etc. Your mileage may vary.





Copyright © 2001 Hewlett Packard Company