Using DCPI/ProfileMe performance measures
to assess program performance

Paul J. Drongowski
Hewlett Packard Company
110 Spit Brook Road
Nashua, New Hampshire 03062

27 July 2001
Copyright © 2001 Hewlett Packard Company

The HP Continuous Profiling Infrastructure (DCPI) is a powerful tool for the analysis of dynamic program behavior. With DCPI, a developer can collect and analyze program profile data to find performance bottlenecks.

This paper describes a collection of performance measures to detect and isolate performance issues. We illustrate the use of these measures by analyzing a small example, a program to perform integer factorization using the Elliptic Curve Method.

1. Introduction

1.1 A quick overview of DCPI

DCPI uses sampling to collect data about programs as they execute. This data is provided by special registers in the Alpha processor hardware. The DCPI data collection subsystem periodically reads this data from those registers and stores it in a database. The DCPI database is stored in a directory, which contains a number of subdirectories, and of course, files containing the program profile data. Sample data is organized by epoch (the time period in which data was collected), process ID and executable image. DCPI collects data about programs that run during an epoch giving the developer a system-wide view of execution. Once profile data has been collected, it can be analyzed and displayed using any of several analytical tools.

Alpha 21064, 21164 and 21264 processors implement a set of "aggregate" performance counters that count hardware events like D-cache miss, branch mispredictions, etc. Using a random sampling period, the data collection subsystem interrupts the running program and writes the current program counter value and hardware counter values to the database. This method of sampling is called "PC sampling."

Alpha 21264A processors (and later) use a different method called "instruction sampling." PC sampling on out-of-order execution engines like the Alpha 21264 smears and skews sample data and profile information cannot be precisely attributed to specific instructions. Instruction sampling solves this problem by periodically selecting a specific instruction and collecting data about it as it flows through the processor pipeline. The program counter is known precisely as well as the execution history of the instruction. The problems of smear and skew are eliminated. Like PC sampling, the sampling period is randomized to get a statistically meaningful estimate of program behavior.

The Alpha implementation of instruction sampling is called "ProfileMe" and it is the focus of this paper.

1.2 DCPI in action

Here is a very brief example of DCPI in action. Collection and analysis takes seven steps:

Create a directory for the DCPI database (mkdir)
Create an environment variable, DCPIDB, to remember the path to the DCPI database (setenv)
Become the superuser (su)
Start the DCPI data collection subsystem (dcpid)
Run the application program to be analyzed
Stop the data collection subsystem (dcpiquit)
Use the DCPI analytical tools to analyze and display profile information (such as dcpiprof, dcpilist, dcpitopcounts, etc.)

The first three steps only need to be performed once, or when needed. An existing DCPI database directory can be reused as DCPI will create a subdirectory for each new epoch. Superuser privileges are required because of certain security concerns in the present implementation of the DCPI data collection subsystem. Once data has been collected, you may become a regular user again.

The session transcript below shows these seven steps. The command lines are numbered, e.g., "3 >" and the numbers correspond to the steps outlined above.

1 > mkdir /dsk0h/dcpidb
2 > setenv DCPIDB /dsk0h/dcpidb
3 > su
Password:
4 > dcpid $DCPIDB
dcpid: load   : loaded pcount driver
dcpid: monitoring pm
dcpid: monitoring cycles
dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log
5 > ecm4c.test 1000000 63844855 < big_number
GMP-ECM 4c, by P. Zimmermann (Inria), 16 Dec 1999, with contributions from
T. Granlund, P. Leyland, C. Curry, A. Stuebinger, G. Woltman, JC. Meyrignac,
A. Yamasaki, and the invaluable help from P.L. Montgomery.
Input number is 44959025334433976986064813184161514864529598931996810690621976
170435025988493693912396407377545697917020929743416462709862460259766349010994
4575251386017 (153 digits)
Using B1=1000000, B2=234709020, polynomial x^12, sigma=63844855
Step 1 took 24909ms for 12982265 muls, 3 gcdexts
Step 2 took 22611ms for 5626744 muls, 17038 gcdexts
********** Factor found in step 2: 241421225374647262615077397
Found probable prime factor of 27 digits: 241421225374647262615077397
Probable prime cofactor 186226481390212213142337234455933956750339187108843670
8374700394762064021217072743463856958990845558484946068708307156081498461 
has 127 digits
6 > dcpiquit
dcpictl: quit successful
7 > dcpiprof -i -pm retired -event cycles

The dcpid command starts the DCPI collection subsystem, which collects processor cycles and ProfileMe events. The application program is called ecm4c.test and it performs integer factorization. The dcpiquit command stops the DCPI data collection subsystem. At this point in the session, DCPI has created a database and we're now ready to start analysis. We'll take up the discussion with step 7 in the next section.

1.3 Levels of detail

This section continues the example above and demonstrates the use of two tools, dcpiprof and dcpilist. We'll be using these two programs throughout the paper.

The DCPI analytical tools can produce four levels of summary information. The levels are:

System: All program images.
Image: Executable image (program image, shared library, the operating system, etc.)
Procedure: Individual procedures (functions) within an image.
Instruction: Individual instructions within a procedure.

We'll now show how to display these kinds of summaries.

We ended the example above with the DCPI command:

7 > dcpiprof -i -pm retired -event cycles

This command produces a system level summary and an image-by-image summary of retired instructions and processor cycles:

20 > dcpiprof -i -event cycles -pm retired
Column                Total  Period (for events)
------                -----  ------
cycles               248126  126976
retired:count        517438  126976

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
                       retired        
cycles       %    cum%  :count       % image
193926  78.16%  78.16%  427465  82.61% ecm4c.test
 27351  11.02%  89.18%   24005   4.64% /usr/shlib/libc.so
 26322  10.61%  99.79%   65532  12.66% /vmunix
   151   0.06%  99.85%      44   0.01% unknown@positron
   126   0.05%  99.90%     108   0.02% /usr/bin/dcpid
    67   0.03%  99.93%      19   0.00% /usr/shlib/libpthread.so
    55   0.02%  99.95%      59   0.01% /sbin/loader
    30   0.01%  99.96%      18   0.00% /usr/shlib/libXt.so
    15   0.01%  99.97%      10   0.00% /usr/shlib/libX11.so
    11   0.00%  99.97%       2   0.00% /usr/shlib/X11/libdix.so
    11   0.00%  99.98%       1   0.00% /usr/shlib/X11/libos.so
     9   0.00%  99.98%       1   0.00% /usr/dt/bin/dtterm
     8   0.00%  99.98%       6   0.00% /usr/shlib/libmach.so
...
     0   0.00% 100.00%     145   0.03% /dsk0h/dcpidb/PALcode

In all of our examples, we will use ellipses (...) to denote additional, irrelevant, output that has been left out. The -i option directs dcpiprof to produce a system-level summary and an image-by-image breakdown of cycle and instruction retire data. The system-level summary is shown in the table at the top of the output:

Column                Total  Period (for events)
------                -----  ------
cycles               248126  126976
retired:count        517438  126976

During the collection epoch, 248,126 cycle samples and 517,438 retired instruction samples were counted. Each sample accounts for 126,976 events (on average.)

The image-level summary shows the number of cycle and retired instruction samples per executable image. Notice that the breakdown shows every image in the system that was active during the data collection epoch. Our program, ecm4c.test, heads the list as it is extremely compute-bound. The second entry, /vmunix, is the Tru64 operating system. The images ending with .so are shared object libraries. The image /dsk0h/dcpidb/PALcode is the Privileged Architecture Library, which is the bridge between user code and the operating system. Finally, the program /usr/bin/dcpid is the DCPI data collection daemon. Notice that collection overhead is relatively low, a good property for a profiling system!

The image-level summary helps identify candidates for further analysis. In this case, our program image, ecm4c.test, is the best candidate to examine further. A deeper look takes us to the procedure level. The dcpiprof tool also produces a procedure-by-procedure breakdown of a particular executable image. We used the command:

dcpiprof -pm retired -event cycles ecm4c.test

to produce the following breakdown for ecm4c.test:

21 > dcpiprof -event cycles -pm retired ecm4c.test
Column                Total  Period (for events)
------                -----  ------
cycles               193926  126976
retired:count        427465  126976

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
                       retired        
cycles       %    cum%  :count       % procedure                 image
 73297  37.80%  37.80%  200731  46.96% __gmpn_addmul_1           ecm4c.test
 15166   7.82%  45.62%   27936   6.54% mpz_mod_n                 ecm4c.test
 11816   6.09%  51.71%   12357   2.89% __gmpn_sb_divrem_mn       ecm4c.test
  9365   4.83%  56.54%   23107   5.41% __gmpn_sub_n              ecm4c.test
  8996   4.64%  61.18%   22006   5.15% __gmpn_add_n              ecm4c.test
  8450   4.36%  65.54%   14815   3.47% __gmpn_mul_basecase       ecm4c.test
  7563   3.90%  69.44%   16466   3.85% __gmpz_mul                ecm4c.test
  7270   3.75%  73.18%   13795   3.23% __gmpz_sub                ecm4c.test
  6322   3.26%  76.44%   12117   2.83% __gmpn_mul_1              ecm4c.test
  6282   3.24%  79.68%   12756   2.98% __gmpz_add                ecm4c.test
  4770   2.46%  82.14%    9657   2.26% __gmpn_submul_1           ecm4c.test
  4594   2.37%  84.51%    9315   2.18% __gmpn_sqr_basecase       ecm4c.test
...

The summary table on top shows that 193,926 cycle samples and 427,465 retired instruction samples were attributed to the procedures in ecm4c.test. The hottest procedure, __gmpn_addmul_1, had 73,297 cycle samples and 200,731 retired instruction samples, accounting for 37.8% and 46.96% of cycle samples and retired instruction samples in ecm4c.test. Therefore, the procedure __gmpn_addmul_1 may be a good candidate for even deeper analysis.

We used the command:

dcpilist -pm retired -event cycles __gmpn_addmul_1 ecm4c.test > addmul

to produce an instruction-level analysis. Since __gmpn_addmul_1 is rather lengthy, we redirected the dcpilist output into a file. When we looked at the file with a text editor, we noticed that __gmpn_addmul_1 had three separate regions of code that were hot. Here is the dcpilist output for the first hot region of __gmpn_addmul_1:

__gmpn_addmul_1 (tmp-addmul_1.s):
retired        
 :count cycles freq   
...
   1558   1748 1586        0x120010fa8 : lda     sp, -240(sp)
   1604      2 1586        0x120010fac   stq     s0, 8(sp)
   1616   1726 1586        0x120010fb0   stq     s1, 16(sp)
   1602      2 1586        0x120010fb4   stq     s2, 24(sp)
   1600      4 1586        0x120010fb8   stq     s3, 32(sp)
   1597      0 1586        0x120010fbc   stq     s4, 40(sp)
   1562   1698 1586        0x120010fc0   stq     s5, 48(sp)
   1583      0 1586        0x120010fc4   stq     s6, 56(sp)
   1587      0 1586        0x120010fc8   and     a2, 0x7, a4
   1553      0 1586        0x120010fcc   srl     a2, 0x3, a2
   1669   1761 1586        0x120010fd0   bis     zero, zero, v0
   1652      0 1586        0x120010fd4   beq     a4, 0x12001109c
...

The instruction-level summary shows the distribution of samples for retire and cycle events across the individual instructions of __gmpn_addmul_1. DCPI analyzes the control flow through procedures and computes the relative frequency of the control paths. The freq column is the result of DCPI's analysis and it estimates the number of times an instruction was executed (excluding speculative attempts to execute the instruction.)

The instruction-level summary is presented as an annotated disassembly of the procedure under study. If source code is available, the source and disassembly can be interspersed. See the man page for dcpilist for further information on this capability. Without source, an engineer must be prepared to dig into and understand the annotated assembly code.

1.4 What comes next?

Sections 1.2 and 1.3 demonstrated the basics of DCPI data collection and analysis. Section 1.3 showed how to use dcpiprof and dcpilist to find candidates for analysis and how to drill down to the instruction level.

The rest of the paper is organized in the following way. Section 2 describes the most common performance bottlenecks that may occur when executing a program on Alpha 21264. Section 3 introduces a set of performance measures that can help identify the bottlenecks described in Section 2. Next, Section 4 applies the performance measures to the example program ecm4c.test. Finally, Section 5 describes the performance measures in more detail.

2. Common performance problems

The Alpha 21264 is a superscalar, pipelined processor. It achieves high throughput by identifying and exploiting instruction level parallelism (ILP.) ILP is exploited through pipelining (successive stages of parallel instruction execution), multiple functional units (four integer and two floating point), and concurrent computation while memory operations complete (through pending load and store queues and other memory system components.) Best performance is achieved when instructions and data flow through the processor without stalling the pipeline.

The 21264 is opportunistic. It issues instructions for execution as soon as they are data-ready, even if the instructions execute in a different order than they appear in the program. This is called "out-of-order" execution. Of course, the processor cannot and does not arbitrarily shuffle instructions. The Alpha 21264 enforces rules that guarantee program correctness. Out-of-order execution maximizes the use of functional units and the memory system.

The memory system maintains a queue of load instructions and a queue of store instructions. These queues let the pipeline continue program execution while memory operations proceed to completion.

There are three main kinds of performance problems that may arise during program execution:

Control flow mispredictions,
Memory system replay traps, and
Memory system delays.

We will take a look at each of this potential issues in turn. We will also briefly discuss resource availability and how that factor might affect performance.

2.1 Control flow mispredictions

The Alpha 21264 fetches instructions aggressively to keep its pipeline busy. The processor fetches ahead of the current architectural state of the program in order to produce a sufficiently steady stream of instructions into the pipeline. In straight-line code, the processor just needs to fetch ahead of the program counter as it increments along through memory. But, all programs have conditional branches and jumps that change the flow of control. To keep the instructions coming, the Alpha 21264 speculates about the likely outcome of conditional branches and jumps, and predicts the new program counter value and continues to fetch from there.

Fetched instructions flow into the pipeline stages for execution. In fact, an instruction on a speculated program path may begin execution before the actual branch condition is known! Once the condition is known, and if the prediction was correct, the instruction is known to be on the good path and execution continues until the instruction completes and retires. (When an instruction retires, the processor has committed the result of the instruction to the architectural state of the machine; program execution has "officially" advanced.) If the prediction was incorrect, the instruction is on the bad path, and any intermediate computation must be discarded. The processor must "back up" to the conditional branch (or jump) and begin fetching down the good path. The pipeline flush and restart operation is costly as well as the wasteful execution of bad path instructions.

Clearly, the instruction stream, and performance, depend upon the ability of the processor to correctly predict conditional branches and jumps. If the vast majority of conditional branches and jumps are predicted correctly, the flow of instructions into the pipeline will be fast and regular.

There a few things that a programmer can do to help with branch and jump prediction.

Make branches more predictable. Choose an algorithm and data structures that make branches more predictable. Avoid techniques with seemingly "random" decisions. Bias two-way decision making such that one choice is more frequently chosen than the other.
Use conditional moves. A conditional move (CMOV) is similar to an assignment from a conditional expression in C language, e.g., A = (cond ? B : C). The compiler can detect these special cases and generate conditional move instructions instead of a conditional branch.
Unroll loops. Loop unrolling creates more straight-line, branch-free code and reduces the dynamic frequency of conditional branch instructions that check the loop bound or exit condition.
Use in-lining. In-line expansion of procedures reduces the number of subroutine call (JSR) and return (RET) operations, and decreases the depth of the dynamic call stack that is used to predict returns.

Procedure in-lining sometimes has an adverse effect on program size and instruction cache (I-cache) footprint, and should be used with care.

2.2 Memory system replay traps

As mentioned earlier, the Alpha 21264 uses out-of-order execution to exploit ILP. The processor maintains program correctness by preserving the producer-consumer relationships between instructions. The 21264 uses a technique called "register renaming" to analyze and preserve these relationships when instructions pass values between each other in registers. The register numbers (names) are right there in the instructions and the processor can perform on-the-fly dependency analysis on the instructions.

Data dependency between load and store instructions cannot be detected through on-the-fly analysis, however. Data dependency cannot be detected until effective addresses have been computed, compared, and found to refer to the same memory location. Thus, memory instructions must be issued and begin to execute deep in the pipeline before data dependency can be detected.

Here is an example. In the program excerpt below, the load instruction uses the result produced by the store instruction:

...
stq t0, 10(a0)
...
ldq v0, 10(a0)
...

The value is passed through a common memory location. Due to out-of-order instruction issue, the consumer (LDQ) may issue and begin to execute before the producer (STQ.) When the store instruction issues and begins execution, the memory system detects that the load and store refer to the same memory location (i.e., the effective addresses are the same.) It also detects that an incorrect result will be produced if the load is allowed to complete since the correct value has not yet been written to memory by the store instruction. To correct this situation, the pipeline traps and replays the load instruction from the beginning. The trap and recovery mechanism is called a "replay trap" and they are expensive.

There are six main kinds of Alpha 21264 memory system replay traps:

A load-store order replay trap occurs when a load, which consumes the value from a store to the same address, has issued before the store. The store is allowed to complete before restarting execution from the load.
A load-load replay trap occurs when two loads from the same address issue out of program order. The loads are forced to complete in program order.
A load-miss load replay occurs when a load misses and it is followed by another load from the same address. The first load is allowed to complete before executing the second load.
A wrong size replay trap is performed when a store is followed by a load that reads the same data and the load data type is larger than the store. The store is allowed to complete before the load is issued.
A cache line contention replay trap occurs when loads and stores that are in progress at the same time map to the same cache line (32Kb spacing/stride.) The processor replays the instruction that triggered the replay trap until the earlier memory operations have completed.
A queue full replay trap occurs when the load or store queue is full and an incoming load or store instruction (respectively) must wait for queue space. The processor replays the requesting instruction until queue space is available.

The cost of a replay trap depends upon how often the replay trap occurs, the nature of the replay trap itself, and the number of instructions that must be aborted (discarded.) The aborted instructions collectively are called the replay trap "shadow" and the length of the shadow is the number of aborted instructions.

Most replay traps can be avoided through careful programming practice. Here are some suggestions.

Avoid pointers that alias to the same data items (memory locations.) Compilers must treat pointer references conservatively and will generate code to read a pointer-referenced value just in case a store through a different pointer has modified it. Such loads and stores are ripe for order replay traps.
Give the compiler information. Given enough information about access to memory-resident data structures, the compiler will keep values in registers, reducing the potential for order traps.
Write and read data items in "whole" chunks. Wrong size replay traps are more likely to occur when manipulating a small field within a larger memory data item. It's better to work on the entire date item.
Pad arrays to avoid common alignment on 32Kbyte boundaries. Arrays that are a power of two long are susceptible to cache line contention. Pad each array by at least a cache line (64 bytes of data) to help avoid 32Kbyte contention traps.
Avoid array strides of 32Kbytes. A 32Kbyte stride will consistently hit the same cache line triggering cache line contention replay traps.
Do work between loads and stores. Tight, memory-intensive loops may cause the load and store queues to fill. Doing CPU-intensive work between memory operations gives memory system time to work off pending operations in the queues.

2.3 Memory system delays

Data layout in memory is a critical, performance-oriented programming consideration and will become even more important with non-uniform memory access (NUMA) machines. Physical memory is organized into a hierarchy of levels where each level is an optimal trade-off between memory size and speed. Each level can be characterized by its size and latency, which is the number of processor cycles needed to read a value from that level into a CPU register and use it (load-use.) Here are capacity and load-use latency values for the XP-1000 workstation:

Level	Capacity	Latency
Level 1 (L1) cache	64Kbytes	3 cycles
Level 2 (L2) cache	4Mbytes	26 cycles
Primary memory	256Mbytes+	130 cycles
Backing store	Gbytes and up	Millions of cycles

The L1 caches communicate directly with the central processor and are the fastest in order to keep up with the CPU. Backing store, which holds non-resident virtual memory pages, is the biggest and slowest. NUMA machines subdivide primary memory into local and remote memory -- each with a different capacity and latency. Remote memory may be 1,000 to 2,000 processor cycles away. A memory request moves through the levels of the hierarchy (L1 -> L2 -> Primary memory -> Backing store) until the data item is found. The data item and some of its nearest neighbors are brought forward in the hierarchy under the assumption that the item and its neighbors will be used in the very near future.

Effective use of the memory hierarchy depends upon locality of reference. Locality is both spatial and temporal. Spatial locality refers to memory items that are near each other in memory. Temporal locality refers to memory items that are recently used. A program with good locality tends to use and reuse items within a small neighborhood of memory. Programs with poor locality jump between distant memory addresses and never reuse recently referenced data items.

Memory levels may also be characterized by the kinds of addressing needed to access a data item. On the XP-1000 workstation, the L1 caches use virtual addresses and the L2 cache and primary memory use physical addresses. Primary memory is subdivided into physical memory pages and page tables define the mapping from virtual to physical addresses. The page tables also describe where non-memory resident pages are in the backing store. The processor handles virtual to physical address translation in hardware because this mapping must be performed fast. The processor uses special memory tables called "translation look-aside buffers" (TLBs) to hold the most recently used virtual to physical address translation information. The TLBs have a finite capacity and must be updated whenever a page is used and that page cannot be found in the TLB. The update is not as costly as a page fault (i.e., the page is on backing store), but there is a non-trivial penalty nonetheless.

A symptom of poor locality is a long load-use latency. If the processor cannot find data in one of the closer, faster memories, then the memory request must go out to farther, slower memory units.

Here are some suggestions for achieving good locality and memory system performance.

Learn about the memory hierarchy on your platform. Determine the capacity and load-use latency for each level.
Know where your data is. This is especially critical on NUMA multiprocessors where an array may be partitioned across several remote memory units. Keep data as close to the processor as possible.
Be cache line aware. An Alpha cache line is (typically) 64 bytes. Utilize each line to its fullest extent, then move on.
Don't hop around in memory. Step sequentially through arrays. Keep related, frequently called procedures close to their callers.

HP's Spike is a tool that performs hot-cold optimization of procedures (among many other optimizations.) Hot-cold optimization pulls related, frequently called procedures together.

2.4 Resource availability

The Alpha 21264 has a number of resources that are allocated dynamically. Each resource is allocated from a finite pool, and if the pool is empty, then the processor must react, stall, or recover in some way. Some of these resources require an understanding of arcane machine features (e.g., miss address file, victim buffer, I/O write buffer, etc.) and we will not discuss them here.

Two important resources that we can all relate to are the physical register pool and the functional units. Up to 80 instructions may be in flight at any one time in the Alpha 21264 pipeline. The pipeline uses two (integer and floating point) large pools of physical registers and register renaming to keep many instructions in flight. The processor tracks instructions and their register use with a tag-like entity known as an "Inum." Although it occurs infrequently, computationally dense code may run out of physical registers and Inums. This condition results in a "map stall" that persists until instructions in the "back end" of the pipeline retire and release physical registers and Inums.

The integer functional units have four 64-bit adders, four logic units, two barrel shifters with byte logic, two sets of conditional branch logic, one pipelined multiplier, and one pipelined unit to perform count extension (CIX) and multimedia extension (MVI) instructions. The floating point execution unit has one pipelined multiplier, one pipelined adder, one non-pipelined divide unit (associated with the FP adder pipeline) and one non-pipelined square root unit (associated with the FP adder pipeline.) These limitations affect the number of instructions of a particular type that can be executed concurrently in a single processor cycle.

We have not emphasized performance issues related to these resources since there is not really much that a C or FORTRAN programmer can do to affect register mapping and functional unit scheduling. The compiler makes decisions about register allocation and instruction scheduling and the programmer can affect these decisions indirectly, at best, through optimization switches and rearranging code at the source level. The Alpha 21264 then makes its own dynamic decisions through register renaming and instruction issue rules.

3. Performance measures (overview)

This section is a brief introduction to performance measures that can be applied to a program to assess its performance. Section 5 describes these measures in more detail including some important conditions to consider when applying them.

The table below is a list summarizing the measures that we will discuss. Some of these measures can be used to assess the program as a whole and generally indicate the presence of performance problems. Other measures deal with specific performance areas: control flow misprediction, memory system replay traps, and memory system delays.

**Performance measures**
Measure	Indication	Goal
cycles	Cycles (time) spent	Decrease
retired	Retired instructions	Decrease
retired/cycles	Cycles needed to retire instruction	Increase
retired/aborted	Useful work vs. wasted effort	Increase
retired/trap	Useful work per trap	Increase
retired/replay trap	Useful work per replay trap	Increase
retired/mispredict	Useful work per mispredict	Increase
cond br mispredict/retired	Conditional branch misprediction rate	Decrease
jsr mispredict/retired	JSR group misprediction rate	Decrease
ldstorder	Load-store order replay trap	Decrease
replays	Memory system replay trap	Decrease
retired/DTB miss	Useful work per DTB miss	Increase
retired/ITB miss	Useful work per ITB miss	Increase
nyp/retired	Lower bound I-cache miss rate	Decrease
retire delay/valid instructions	Average retire delay	Decrease

3.1 Overall assessment

Seven measures can be used to assess the program as a whole and to perform a little high-level "triage." (See the next table below.) In our earlier example, we used processor cycle samples (cycles) and the number of retired instructions (retired) to identify hot spots in the application. In addition to this role, cycles and retired instructions provide a basis for comparison through ratios. Although one can compare the raw number of replay traps from different experimental runs, ratios offer a few advantages over raw, absolute event counts.

A ratio is able to state the relative frequency of event types, for example, the number of retires per replay trap. Raw sample counts are often big numbers and a ratio makes it easier to judge the effect of a particular event type on performance.
There is some variability across experimental runs due to differences in sample selection. Ratios show more consistency across multiple runs and offer a better means of comparison.

We will use retired instructions as the basis for many of the measures. Each retired instruction is a successful computational step that moves program execution ahead. In that sense, a retired instruction is a very basic measure of "useful work."

**Performance measures for overall assessment**
Measure	Indication	Goal
cycles	Cycles (time) spent	Decrease
retired	Number of instructions retired	Decrease
retired/cycle	Instructions retired per cycle	Increase
retired/aborted	Useful work vs. wasted	Increase
retired/trap	Useful work per trap	Increase
retired/replay trap	Useful work per replay trap	Increase
retired/mispredict	Useful work per mispredict	Increase

Retired instructions per cycle (and its reciprocal) is a rough measure of concurrency, i.e., how well the machine is exploiting ILP. If the processor is retiring, on average, more than one instruction per cycle, then concurrency is being realized. Architectural research shows that 3.3 retires per cycle can be achieved in integer code with loads that hit in the D-cache, but this is unlikely to be sustained. A retire/cycle ratio that approaches 3 is very good for integer code. Similar analysis for floating point shows a maximum of 2.3 retires per cycle for short periods. A retire/cycle ratio approaching 2 is very good for floating point code.

The four ratios:

retired / aborted,
retired / trap,
retired / replay trap, and
retired / mispredict

provide a quick way to assess the effect of aborted (discarded or killed) instructions, traps of all kinds, memory system replay traps specifically, and control flow mispredictions (conditional branch and jumps.) The ratio of retired instructions to aborted instructions is a rough measure of efficiency, that is, how much useful work is done for work that is discarded. A better measure of efficiency would be the number of cycles lost due to aborted instructions (via traps.) Unfortunately, due to limitations of the performance measurement hardware in the Alpha 21264A, we cannot quantify this notion of efficiency. It's hard to specify a particular retire/abort threshold. Two retires per abort is clearly bad. The threshold depends upon one's tolerance of inefficiency. It is probably impossible to remove all traps, especially in code with memory traffic.

The ratio of retired instructions per trap and the ratio of retired instructions per memory system replay trap indicate the frequency of all traps and memory system replay traps, respectively. Fifty retires per trap indicates poor performance and the cause of the traps should be investigated.

The ratio of retired instructions per mispredict indicates the frequency of conditional branch and JSR group (jump) mispredicts. As the analysis in Section 5 shows, a high prediction rate is needed to produce a steady stream of good path instructions. As a rough guideline, a programmer should strive for a minimum of twenty retired instructions per misprediction. Of course, even higher ratios are desirable. Good prediction will bring down the number of aborted instructions since more fetched and issued instructions will be on the good path and they will successfully complete and retire.

The four TLB trap types can skew the number of retired instructions, sometimes making system-level statistics look better than they actually are. This bias should be removed from system-level measures. The four TLB trap types are:

DTB single miss (ProfileMe: dtbmiss, PALcode entry: dtbm_single)
DTB double miss, 3-level page tables (ProfileMe: dtb2miss3, PALcode entry: dtbm_double_3)
DTB double miss, 4-level page tables (ProfileMe: dtb2miss4, PALcode entry: dtbm_double_4)
ITB miss: (ProfileMe: itbmiss, PALcode entry: itb_miss)

Of these four, DTB miss is the most frequently occurring TLB trap. The TLB traps force an entry into the Privileged Architecture Library, more commonly known as "PALcode." The PALcode updates the TLB before restarting execution of the offending instruction. PALcode routines are specially coded and tuned Alpha subprograms. DCPI collects sample data for PALcode just like any other active image. Here is an excerpt from the dcpiprof image-by-image summary:

                       retired        
cycles       %    cum%  :count       % image
193926  78.16%  78.16%  427465  82.61% ecm4c.test
 27351  11.02%  89.18%   24005   4.64% /usr/shlib/libc.so
 26322  10.61%  99.79%   65532  12.66% /vmunix
...
     0   0.00% 100.00%     145   0.03% /dsk0h/dcpidb/PALcode
...

In our example, only 145 retired instruction samples were collected for the PALcode image. (The PALcode image is recreated in the DCPI database for convenient access.) When a substantial number of DTB miss or other TLB miss events are recorded, the number of PALcode retires may be quite high. In this case, use the dcpiprof command:

dcpiprof -pm retired /dsk0h/dcpidb/PALcode

to produce a procedure-by-procedure breakdown of retired instruction samples attributed to the PALcode image. Here is a short fragment of the breakdown taken from the example:

retired                
 :count       %    cum% procedure                 image
     25  17.24%  17.24% swpipl_cont               /dsk0h/dcpidb/PALcode
     21  14.48%  31.72% dtbm_single               /dsk0h/dcpidb/PALcode
     17  11.72%  43.45% subr_6c08                 /dsk0h/dcpidb/PALcode
     15  10.34%  53.79% swpipl                    /dsk0h/dcpidb/PALcode
     11   7.59%  61.38% interrupt                 /dsk0h/dcpidb/PALcode
     10   6.90%  68.28% interrupt_subr_6400       /dsk0h/dcpidb/PALcode
      8   5.52%  73.79% rti_cont                  /dsk0h/dcpidb/PALcode
      5   3.45%  77.24% dtbm_double_3_cont        /dsk0h/dcpidb/PALcode
      5   3.45%  80.69% callsys_cont              /dsk0h/dcpidb/PALcode
      4   2.76%  83.45% swpctx_cont               /dsk0h/dcpidb/PALcode
...

The DTB miss (single) routine had only 21 retired instruction samples. Thus, in our example, we can ignore the effect of DTB misses on retires. If the number of retired instruction samples is high, more than one or two percent, then the retired instruction samples for DTB miss should be subtracted out from the number of system-level retired instruction samples. This action will prevent DTB misses from skewing the system-level estimate of retired instructions.

One other component can skew system-level measures -- the operating system idle loop. The idle loop burns cycles, has good ILP and rarely traps. Thus, it can make system-level statistics look too good. Use the command:

dcpiprof -pm retired /vmunix | more

to obtain a procedure-by-procedure breakdown of the Tru64 operating system kernel. Here is an excerpt from the example:

retired                
 :count       %    cum% procedure                 image
  64651  98.66%  98.66% idle_thread               /vmunix
    104   0.16%  98.81% hardclock                 /vmunix
     47   0.07%  98.89% simple_lock               /vmunix
     41   0.06%  98.95% ufs_sync_int              /vmunix
     40   0.06%  99.01% table                     /vmunix
     37   0.06%  99.07% malloc_internal           /vmunix
...

The procedure named idle_thread is the idle loop. In this case, 64,651 retired instruction samples have been attributed to the idle loop out of 65,532 total for the image /vmunix. The system-level measure of retired instructions should be reduced by the number of retired instruction samples attributed to the idle loop.

The measures in this section can be applied at the image, procedure and instruction levels as well as the overall system level. Don't hesitate to apply these measures at finer levels of granularity, too.

Although we have refered to "cycles," "retired instructions," etc. in the last several paragraphs, the terms "cycle samples" and "retired instruction samples" is more precise and correct. Throughout this paper, the word "samples" is usually implied, if not implicitly stated. That is, after all, what we are combining in the numeric formulas.

3.2 Control flow mispredictions

As mentioned in Section 2, the Alpha 21264 predicts the outcome of conditional branches and jumps in order to fetch instructions beyond them. More precisely, "jumps" are the JSR group of Alpha instructions. The JSR group includes:

JSR: Jump to subroutine
RET: Return from subroutine
JSR_COROUTINE: Jump to subroutine reture
JMP: Jump

All share the same opcode and the variants are distinguished by the "hint bits" in the instruction's displacement field. The hint bits affect the way the processor predicts the target address.

The Alpha 21264 uses two different mechanisms to predict the target of conditional branches and JSR group instructions.

The branch predictor has three separate predictors: global, local and choice. All are table-driven. The local predictor holds and uses the history of individual branches to predict branch direction (taken or not taken.) The global predictor holds a global history of all recent branches and uses that information to predict direction. Finally, the choice predictor monitors the history of the local and global predictors and chooses the best of the two predictors for a particular branch.
The subroutine prediction stack is used to predict JSR group instructions (except JMP.) The stack is implemented as a circular queue that wraps around on overflow and underflow. BSR, JSR, and JSR_COROUTINE instructions push information about the return address onto the prediction stack. The RET and JSR_COROUTINE instructions pop the stack and predict the next PC.

Mispredictions occur due to insufficient capacity (branch history table too small or prediction stack not deep enough), destructive interference between two or more branches (aliasing), and plain old unpredictability (no recurring patterns.)

**Performance measures for control flow mispredictions**
Measure	Indication	Goal
cond br mispredict/retired	Conditional branch misprediction rate	Decrease
jsr mispredict/retired	JSR group misprediction rate	Decrease

The ratio of conditional branch mispredictions to retired instructions is the misprediction rate for a conditional branch. The ratio of JSR group mispredictions to retired instructions is the misprediction rate for a JSR group instruction.

The best way to apply these measures is to find individual conditional branch, return (RET), or subroutine call (JSR) instructions with a high number of misprediction samples. Follow the procedure illustrated in the example in Section 1.3 and use dcpiprof to drill down to frequently mispredicted instructions. Ideally, individual misprediction rates should be as low as possible (5% or below.)

3.3 Memory system replay traps

The ratio of retired instructions per replay trap is usually enough to identify performance loss through memory system replay traps. This section suggests some DCPI/ProfileMe event types that can be used to analyze traps. These events are most meaningful at the instruction level rather than aggregate measures at the system, image or procedure levels.

**Performance measures for memory system replay traps**
Measure	Indication	Goal
ldstorder	Load-store order replay trap	Decrease
replays	Memory system replay trap	Decrease

The ldstorder event indicates that an instruction caused a load-store order replay trap. This attribution is precise. If the ldstorder event is not asserted and the instruction has replay trap events (replays) attributed to it, then the replay trap was one of the following:

Load-load order trap
Wrong-size trap
B-cache alias (two physical addresses map to the same L2 cache line)
D-cache alias (two physical addresses map to the same L1 D-cache line)
Load queue full
Store queue full
L1 I-cache parity error
L1 D-cache ECC error

In this case, detective work is needed to further narrow the cause. Analysis of the disassembly or program source code may indicate a wrong-size trap (writing a field within a word and then reading it), cache alias (32Kbyte stride), or load/store queue full (a loop with dense read/write memory traffic.) Parity and ECC errors are rare and can generally be ruled out.

3.4 Memory system delays

3.4.1 Hits and misses

The ratio of retired instructions to ITB misses and the ratio of retired instructions to DTB misses indicate the frequency of specific TLB misses. DTB misses are more likely to be a problem than ITB misses, especially in programs using large arrays or a big random-access heap.

It is difficult to set a problem threshold for TLB misses, however, 50-100 retired instructions per DTB miss, for example, would be unacceptably high. Let's take the simple case of a program that walks sequentially through a very large array of long integers (8 bytes per integer.) Each DTB entry supports access to an 8,192 byte page, which is equivalent to 1024 long integers. Assume that the loop is very tight and consists of four instructions:

A load instruction to read a long integer,
An increment instruction to advance the address pointer,
A compare instruction to compute the termination condition, and
A branch instruction to test the termination condition.

A DTB miss should occur after walking through each block of 1,024 long integers. Thus, the expected ratio of retires to DTB misses is:

(4 instructions · 1024 iterations) / 1 DTB miss = 4,096 retired instructions per DTB miss

This kind of analysis/reasoning can help you to find your comfort level for a particular program.

**Performance measures for memory system delays**
Measure	Indication	Goal
retired/DTB miss	Useful work per DTB miss	Increase
retired/ITB miss	Useful work per ITB miss	Increase
nyp/retired	Lower bound I-cache miss rate	Decrease
retire delay/valid instructions	Average retire delay	Decrease

The ProfileMe event nyp is asserted when a profiled instruction is contained in an aligned 4-instruction I-cache fetch block that requested a new I-cache fill. The event name is an acronym for "not yet prefetched," which is a pretty apt description. The I-cache fill is in response to an I-cache miss. Unfortunately, the performance counting hardware is unable to count all I-cache misses. For example, The nyp bit is not set if the prefetcher asked for the instruction a mere cycle before the fetcher needed it. Thus, the number of nyp samples is only a lower bound on the total number of I-cache misses.

With that limitation in mind, the ratio of nyp samples to retired instruction samples is an optimistic approximation of the I-cache miss rate. The true rate is somewhat higher. Unfortunately, there is no way to estimate the error between the approximation and the true rate.

3.4.2 Retire delay

Retire delay is a lower bound on the number of processor cycles that an instruction has delayed its own retirement and the retirement of the other instructions that follow it in program order. As a measure, it is most effective at the instruction level. A large retire delay may indicate the presence of a bottleneck. The ProfileMe retire delay (retdelay) is defined only for instructions that retire without causing a trap (i.e., instructions that are valid.)

The Alpha 21264A does not include all cycles in its retire delay count. Thus, the total cycle count is almost always larger than the retire delay count. (Ideally, they would be the same.) Cycles are not counted while:

Waiting for I-cache misses,
Aborting instructions killed by pipeline traps, and
Restarting the pipeline after a pipeline trap.

The difference between cycles and retire delay:

cycles - retire delay

is an estimation of how many cycles are lost to I-cache misses, pipeline flushes and pipeline restarts.

Average retire delay is the total instruction retire delay divided by the number of valid retired instructions (i.e., the number of instructions that retired without trapping.) Average retire delay can be used to identify and analyze long-latency load operations in the following way:

Find instructions with a high retire delay.
Compute the average retire delay for those instructions.
If the candidate instructions consume the value produced by a load instruction and if the average retire delay is above a threshold (e.g., 20 cycles), then the producing load instruction may be suffering L1 (D-cache) or L2 cache misses.

The chosen threshold depends upon the latencies within the memory hierarchy. On an XP-1000 workstation, a 20 cycle threshold will eliminate D-cache hits and a 100 cycle threshold will eliminate both D-cache and L2 cache hits, selecting loads that are frequently accessing primary memory.

Here's a simple demonstration of the method in action. Assuming that the procedure function_mn is an important, hot procedure, we used dcpilist to display the retire delay and retired instruction samples for its instructions. A quick scan of the annotated disassembly showed a subtract instruction (subq) with a large retire delay:

1 > dcpilist -pm valid:retdelay+retired function_mn example_prof | more
function_mn (example.c):
...
    valid   valid
:retdelay  :count freq   
        2     112  105        0x12001d180   ldq     t9, 144(sp)
    18677     100  105        0x12001d184   subq    t9, t8, t5
...

We then reran dcpilist, this time displaying the average retire delay for each instruction:

2 > dcpilist -pm valid:retdelay::valid function_mn example_prog | more
function_mn (example.c):
    valid 
:retdelay 
       :: 
    valid 
   :count 
...
   0.0179      0x12001d180   ldq     t9, 144(sp)
 186.7700      0x12001d184   subq    t9, t8, t5
...

The average retire delay of the subtract instruction easily exceeds our working threshold of 20 cycles. The subtract consumes a value produced by a load instruction (ldq) and the value is passed in register t9. The load, therefore, is probably the object of L1 D-cache misses.

In this case, it wasn't hard to find the producing load instruction because the load (the producer) was immediately before the subtract (the consumer.) Detective work may be a little more difficult in real code, although the producer will still pass the value to the consumer in a common register. Since the Alpha 21264A may have up to 80 instructions in flight at once, you may need to look further back in the code, including instructions in a previous iteration of a loop or back through a branch into a different basic block.

Why does the retire delay accumulate on the consuming instruction instead of the load? When the processor begins execution of the load, it evaluates the effective address and puts the load instruction into the load queue where it awaits action by the memory subsystem. If the load does not trap, it is allowed to retire. The processor continues to issue and execute instructions, eventually coming to the consumer. The processor must delay execution of the consuming instruction until the value from the load has been read by the memory subsystem and is available. If the memory system takes a long time to produce the value (i.e., if the value is in the L2 cache or worse, primary memory), then the consuming instruction will be delayed a long time and its retire delay will be high.

Out-of-order execution may hide part of the D-cache miss penalty. Useful work can and sometimes is performed between the load instruction and its consumer. Sometimes one of the intervening instructions suffers a long stall due to an I-cache miss, etc. This time will not be reflected in the retire delay of the consuming instruction.

4. Integer factorization: an example

In this section, we'll show how to apply the basic performance measures. We will use an example program, ecm4c, that performs integer factorization using the Elliptic Curve Method. The program uses the GNU Multiple Precision (MP) Arithmetic Library to perform arithmetic on integers of arbitrarily long size.

4.1 Preliminary comments

The default make procedure for the GNU MP library uses these C compiler flags on an Alpha 21264A-based platform:

-g -arch ev6 -tune ev6

The arch and tune options tell the compiler to produce and tune code for a 21264-based target platform. The arch switch actually implies tune and is sufficient in itself.

Use of -g is more problematic from the performance tuning point of view. The -g option controls the level of symbolic debug information to be generated into the resulting object file/image. The -g switch is equivalent to the -g2 option, which produces full symbolic information for debugging. However, -g suppresses optimization (i.e., -g or -g2 implies the optimization switch -O0.) This default build switch will not provide best performance.

We recommend changing the debug flag to -g3, which produces symbol table information and optimized code. Due to compiler optimizations, some correlation is lost between the source code and the executable program. We chose to change the individual Makefiles, but the correct place to make this change is in configure.in, since the Makefiles are generated from configure.in by autoconf and automake.

To emphasize the point, when ecm4c is built with the -g switch, it executes in about 46.6 seconds. When it is built with -g3 (optimizations enabled), ecm4c executes in 41.8 seconds, a savings of nearly five seconds.

4.2 Specifying events and ratios to DCPI

Here are a few tips on the specification of DCPI/ProfileMe events. The tools dcpiprof and dcpilist accept command line options that specify cycle and ProfileMe events. Processor cycles are counted using an aggregate counter and cycle samples are selected with the option:

-event cycles

ProfileMe events are selected with the option:

-pm XXXXX

where "XXXXX" may be any ProfileMe event or combination of ProfileMe events and ratios. Individual ProfileMe event types are retired, valid, nyp, cbrmispredict, ldstorder, and so forth. See man dcpiprofileme for a complete list. Samples for instructions that did not take a specific ProfileMe event may be selected with an option of the type:

-pm /YYYYY

where the "/" indicates "NOT" and "YYYYY" may be any ProfileMe event. For example, the option:

-pm /retired

selects samples for instructions that did not retire.

The ProfileMe event syntax also permits three kinds of combinations of event specifiers.

Samples for multiple events and ratios can be displayed on a single line by using the "+" combination.
Samples for instructions that meet two (or more) event types can be selected with the "^" combination.
A ratio of samples from two events can be displayed using the "::" combination.

Here is an example ProfileMe option that puts everything together:

-pm retired+/cbrmispredict^mispredict::retired

This option displays the number of retired instruction samples in one column and displays in a second column, the ratio of instructions that suffered mispredicts, excluding conditional branch mispredicts, to the number of retired instructions.

The table below is another list of the performance measures. The third column contains the corresponding simple DCPI command line option for each given measure.

**Performance measures plus command line options**
Measure	Indication	Option
cycles	Cycles (time) spent	`-event cycles`
retired	Retired instructions	`-pm retired`
retired/cycles	Cycles needed to retire instruction	`-pm retired -event cycles`
retired/aborted	Useful work vs. wasted effort	`-pm retired::/retired`
retired/trap	Useful work per trap	`-pm retired::trap`
retired/replay trap	Useful work per replay trap	`-pm retired::replays`
retired/mispredict	Useful work per mispredict	`-pm retired::mispredict`
cond br mispredict/retired	Conditional branch misprediction rate	`-pm cbrmispredict::retired`
jsr mispredict/retired	JSR group misprediction rate	`-pm /cbrmispredict^mispredict::retired`
ldstorder	Load-store order replay trap	`-pm ldstorder`
replays	Memory system replay trap	`-pm replays`
retired/DTB miss	Useful work per DTB miss	`-pm retired::dtbmiss`
retired/ITB miss	Useful work per ITB miss	`-pm retired::itbmiss`
nyp/retired	Lower bound I-cache miss rate	`-pm nyp::retired`
retire delay/valid instructions	Average retire delay	`-pm valid:retdelay::valid`

Command lines and output lines can get pretty long. Here are a few helpful suggestions.

Abbreviate ProfileMe event names down to the minimum number of characters needed for unambiguous recognition. For example, retired can be abbreviated as "ret" and cbrmispredict can be abbreviated as "cbr".
Put long dcpiprof and dcpilist commands into shell scripts and invoke the scripts instead.
Redirect output to a file and browse the file with a text editor.
Write Awk scripts to post-process and reduce the output.

We sometimes use a few short commands instead of one big, long command and often redirect output into files to be edited and combined later.

4.3 System-level assessment

The first step in the system-level assessment is to obtain a summary of cycle and retired instruction samples. The summary may be obtained with the dcpiprof command:

dcpiprof -i -pm retired -event cycles > ret+cyc

Here, we redirected the summary to a file ("ret+cyc") so we can refer to the data again later, if necessary. This command produces the following output:

Column                Total  Period (for events)
------                -----  ------
cycles               248126  126976
retired:count        517438  126976

The numbers given below are the number of samples for each
listed event type or, for the ratio of two event types, the
ratio of the number of samples for the two event types.
===========================================================
                       retired        
cycles       %    cum%  :count       % image
193926  78.16%  78.16%  427465  82.61% ecm4c.test
 27351  11.02%  89.18%   24005   4.64% /usr/shlib/libc.so
 26322  10.61%  99.79%   65532  12.66% /vmunix
   151   0.06%  99.85%      44   0.01% unknown@positron
   126   0.05%  99.90%     108   0.02% /usr/bin/dcpid
    67   0.03%  99.93%      19   0.00% /usr/shlib/libpthread.so
    55   0.02%  99.95%      59   0.01% /sbin/loader
    30   0.01%  99.96%      18   0.00% /usr/shlib/libXt.so
    15   0.01%  99.97%      10   0.00% /usr/shlib/libX11.so
    11   0.00%  99.97%       2   0.00% /usr/shlib/X11/libdix.so
    11   0.00%  99.98%       1   0.00% /usr/shlib/X11/libos.so
...

which shows a system-wide summary at the top and an image-by-image breakdown of cycle samples and retired instruction samples. In the interest of space, we truncated the table, leaving off images that contributed the least to cycles and retires.

Overall cycle samples and retired instruction samples may have been skewed by the kernel idle loop and by PALcode TLB miss handling. In Section 3.1, we saw that TLB miss handling was inconsequential. The contribution of the idle loop, however, is substantial. To check idle loop processing, we used the following dcpiprof command to show a procedure-by-procedure breakdown of kernel processing:

dcpiprof -event cycles -pm retired /vmunix > triage.vmunix

Idle thread processing accounts for 90.56% of the cycle samples attributed to the operating systen kernel, "/vmunix."

                       retired        
cycles       %    cum%  :count       % procedure                 image
 23838  90.56%  90.56%   64651  98.66% idle_thread               /vmunix
   254   0.96%  91.53%      41   0.06% ufs_sync_int              /vmunix
   223   0.85%  92.38%       9   0.01% _XentInt                  /vmunix
   153   0.58%  92.96%      47   0.07% simple_lock               /vmunix
   142   0.54%  93.50%     104   0.16% hardclock                 /vmunix
    54   0.21%  93.70%      37   0.06% malloc_internal           /vmunix
    54   0.21%  93.91%      12   0.02% pmap_zero_page            /vmunix
    51   0.19%  94.10%      40   0.06% table                     /vmunix
...

We need to subtract idle thread cycle samples and retired instruction samples from the system-wide totals to remove the idle loop bias:

cycles           248,126 -  23,838 = 224,288 unbiased cycle samples
retired:count    517,438 -  64,651 = 452,787 unbiased retire samples

We will use the unbiased values to compute the other system-level measures. The overall ratio of retired instructions per cycle is 452,787 / 224,288, or roughly 2.02 retired instructions per cycle. This is not too bad for the system as a whole (with a target of 3 retires/cycle), but perhaps there is some room for improvement.

The next set of measures to compute are the four ratios that indicate potential areas for investigation (Section 3.1.) The following dcpiprof command gives us the raw data that we need:

dcpiprof -i -pm retired+/retired+trap+replays+mispredict > triage.1

This command displays the following system-level summary table at the top of its output:

Column                   Total  Period (for events)
------                   -----  ------
retired:count           517438  126976
!retired:count           30651  126976
!notrap:count             1724  126976
replays:count               98  126976
mispredict:count          1624  126976

The next step is to compute the four system-level ratios, but using the unbiased count of retired instruction samples:

retired / aborted     452,787 / 30,651 =    14.77
retired / trap        452,787 /  1,724 =   262.64
retired / replays     452,787 /     98 = 4,620.37
retired / mispredict  452,787 /  1,624 =   278.81

It's interesting to compare these unbiased ratios against idle loop-biased ratios:

dcpiprof -i -pm ret+ret::/ret+ret::trap+ret::replays+ret::mispredict > triage.2
Column                                  Total  Period (for events)
------                                  -----  ------
retired:count                          517438  126976
retired:count::!retired:count       16.881603      NA
retired:count::!notrap:count       300.138051      NA
retired:count::replays:count      5279.979592      NA
retired:count::mispredict:count    318.619458      NA

Notice that the biased ratios give a better picture of performance than the unbiased ratios.

At this point in the analysis, no major problem is indicated. It may be that statistics for too many different sub-components have been aggregated and problems are hidden. In the next section, we will push the analysis down a level and will take a look at image-level measures.

4.4 Image-level assessment

At the image level, we can get statistics about each of the individual active images in the system. The dcpiprof command:

dcpiprof -i -pm ret+ret::/ret+ret::trap+ret::replays+ret::mispredict > triage.3

produces a table showing the key overall assessment ratios for each of the active images. Here is an excerpt from that table:

          retired    retired    retired     retired
           :count     :count     :count      :count
               ::         ::         ::          ::
retired  !retired    !notrap    replays  mispredict
 :count    :count     :count     :count      :count image
 427465   16.2004   320.1985  4857.5568    342.7947 ecm4c.test
  65532  152.7552  2184.4000  9361.7143   2978.7273 /vmunix
  24005    6.6978    69.7820         NA     69.7820 /usr/shlib/libc.so
    145    1.4948    20.7143         NA     20.7143 /dsk0h/dcpidb/PALcode
    108    1.6615    27.0000    36.0000    108.0000 /usr/bin/dcpid
...

The excerpt contains the most active images. The five images, from top to bottom, are the factorization program (ecm4c.test), the operating system kernel (/vmunix), the C runtime library (libc.so), the PALcode library (PALcode), and the DCPI data collection daemon (dcpid). We will concentrate on ecm4c.test because it has the most cycles and retired instructions. However, the retire/abort ratio for the C library is quite low and would be worth investigating on its own. The low ratio of retired instructions to control flow mispredicts suggests that mispredicts are a problem in the active part of the library code. The retire/cycle ratio for the C library is 24,005/27,423 or roughly 0.88, suggesting that much improvement could be made. The retire/cycle ratio for ecm4c.test is 427,465/193,926 or roughly 2.20.

In the case of ecm4c.test, the high retire/replay ratio rules out replay traps as a likely cause of lost performance. Control flow mispredictions, again, are worth investigating at the procedure and instruction levels. We should also look for long-latency load instructions, as well.

4.5 Procedure-level assessment

In this section we will look for the best candidate procedures for instruction-level analysis.

The dcpiprof command:

dcpiprof -pm retired -event cycles ecm4c.test > triage.4

produces a table giving the cycle samples and retired instruction samples attributed to each procedure in ecm4c.test. Here are the top procedures in the table:

                       retired        
cycles       %    cum%  :count       % procedure                 image
 73297  37.80%  37.80%  200731  46.96% __gmpn_addmul_1           ecm4c.test
 15166   7.82%  45.62%   27936   6.54% mpz_mod_n                 ecm4c.test
 11816   6.09%  51.71%   12357   2.89% __gmpn_sb_divrem_mn       ecm4c.test
  9365   4.83%  56.54%   23107   5.41% __gmpn_sub_n              ecm4c.test
  8996   4.64%  61.18%   22006   5.15% __gmpn_add_n              ecm4c.test
  8450   4.36%  65.54%   14815   3.47% __gmpn_mul_basecase       ecm4c.test
  7563   3.90%  69.44%   16466   3.85% __gmpz_mul                ecm4c.test
  7270   3.75%  73.18%   13795   3.23% __gmpz_sub                ecm4c.test
  6322   3.26%  76.44%   12117   2.83% __gmpn_mul_1              ecm4c.test
  6282   3.24%  79.68%   12756   2.98% __gmpz_add                ecm4c.test
  4770   2.46%  82.14%    9657   2.26% __gmpn_submul_1           ecm4c.test
  4594   2.37%  84.51%    9315   2.18% __gmpn_sqr_basecase       ecm4c.test
...

We edited this table and computed the retire/cycle ratio for each of the procedures to find possible troublemakers. Here is the table:

                       retired                               retired::cycles
cycles       %    cum%  :count       % procedure                 
 73297  37.80%  37.80%  200731  46.96% __gmpn_addmul_1           2.74
 15166   7.82%  45.62%   27936   6.54% mpz_mod_n                 1.84
 11816   6.09%  51.71%   12357   2.89% __gmpn_sb_divrem_mn       1.04 <--
  9365   4.83%  56.54%   23107   5.41% __gmpn_sub_n              2.46
  8996   4.64%  61.18%   22006   5.15% __gmpn_add_n              2.45
  8450   4.36%  65.54%   14815   3.47% __gmpn_mul_basecase       1.75
  7563   3.90%  69.44%   16466   3.85% __gmpz_mul                2.18
  7270   3.75%  73.18%   13795   3.23% __gmpz_sub                1.90
  6322   3.26%  76.44%   12117   2.83% __gmpn_mul_1              1.92
  6282   3.24%  79.68%   12756   2.98% __gmpz_add                2.03
  4770   2.46%  82.14%    9657   2.26% __gmpn_submul_1           2.02
  4594   2.37%  84.51%    9315   2.18% __gmpn_sqr_basecase       2.03

The retire/cycle ratio of the top procedure, __gmpn_addmul_1, is quite good at 2.74. This procedure is written in Alpha assembler code and was hand-tuned for the Alpha 21264 (EV6) processor. The procedure, __gmpn_sb_divrem_mn, has a retire/cycle ratio of only 1.04. This definitely needs investigation!

Before shifting focus to the implementation of __gmpn_sb_divrem_mn, we evaluated the overall assessment ratios for each of the procedures in ecm4c.test using the command:

dcpiprof -pm ret+ret::/ret+ret::trap+ret::replays+ret::mispredict ecm4c.test

Edited output for the most active procedures is:

          retired    retired     retired     retired
           :count     :count      :count      :count
               ::         ::          ::          ::
retired  !retired    !notrap     replays  mispredict
 :count    :count     :count      :count      :count  procedure
 200731   79.7818  2995.9851  22303.4444   3460.8793  __gmpn_addmul_1
  27936   15.4856   410.8235          NA    410.8235  mpz_mod_n
  23107   48.2401   700.2121  11553.5000    745.3871  __gmpn_sub_n
  22006   81.8067   431.4902  22006.0000    440.1200  __gmpn_add_n
  16466   10.7832   191.4651   2744.3333    205.8250  __gmpz_mul
  14815   18.8726   493.8333   7407.5000    529.1071  __gmpn_mul_basecase
  13795    6.7293   130.1415    766.3889    156.7614  __gmpz_sub
  12756    5.6021   103.7073    850.4000    118.1111  __gmpz_add
  12357    2.9027    50.8519   1123.3636     53.2629  __gmpn_sb_divrem_mn  <--
  12117   15.5946   390.8710  12117.0000    403.9000  __gmpn_mul_1
   9657    6.8979   482.8500          NA    482.8500  __gmpn_submul_1
   9315   18.9329  4657.5000          NA   4657.5000  __gmpn_sqr_basecase

The procedure __gmpn_sb_divrem_mn is retiring only three instructions for each instruction that it discards. The low retire to mispredict ratio suggest that we should investigate mispredicts within the procedure __gmpn_sb_divrem_mn. Because this procedure accounts for just 6% of the cycles in ecm4c.test, we will need to substantially improve its performance in order to improve overall program performance.

4.6 Instruction-level assessment

Let's take a look at the conditional branch mispredictions in the procedure __gmpn_sb_divrem_mn to see if we can improve the prediction rate. We used the dcpilist command:

dcpilist -both -event cycles -pm cbr::ret __gmpn_sb_divrem_mn ecm4c.test > sb

to display an annotated listing of the procedure __gmpn_sb_divrem_mn. The listing is quite long so output was redirected to a file (sb.) This command produces both a source and assembler listing of the procedure. The procedure contains a number of conditional branch instructions with poor misprediction rates.

       cbrmispredict 
              :count 
                  :: 
             retired 
cycles        :count freq   
...
  6651        0.0234  107               udiv_qrnnd (q, r1, nx, workaround, dx);
...
     0        0.0000  105        0x12001b8bc mulq    v0, s4, t2
   120        0.0000  105        0x12001b8c0 ldq     s4, 104(sp)
     0        0.0000  105        0x12001b8c4 srl     t6, 0x20, t7
     0        0.0000  105        0x12001b8c8 mulq    v0, s4, t3
     0        0.0000  105        0x12001b8cc subq    s2, t2, t2
   107        0.0000  105        0x12001b8d0 mulq    t2, t4, t2
     0        0.0000  105        0x12001b8d4 bis     t2, t7, t2
     0        0.0000  105        0x12001b8d8 cmpult  t2, t3, a4
     0        0.4414  105        0x12001b8dc beq     a4, 0x12001b900
   618        0.0000   56        0x12001b8e0 addq    t2, s3, t2
     0        0.0000   56        0x12001b8e4 lda     t1, -1(v0)
     0        0.0000   56        0x12001b8e8 cmpult  t2, s3, a5
     0        0.0000   56        0x12001b8ec cmpult  t2, t3, t8
    65        0.0000   56        0x12001b8f0 bic     t8, a5, a5
     0        0.1429   56        0x12001b8f4 blbc    a5, 0x12001b900
   116        0.0000    7        0x12001b8f8 lda     t1, -2(v0)
     0        0.0000    7        0x12001b8fc addq    t2, s3, t2
  1456        0.0000  105        0x12001b900 ldq     t12, -31368(gp)
     0        0.0000  105        0x12001b904 subq    t2, t3, t2
     0        0.0000  105        0x12001b908 ldq     a1, 112(sp)
     0        0.0000  105        0x12001b90c zapnot  t6, 0xf, t6
...

The conditional branches are associated with the statement:

udiv_qrnnd (q, r1, nx, workaround, dx);

in the source file gmp-3.1.1/mpn/sb_divrem_mn.c in the GNU MP library. The code results from the expansion of a rather large macro, udiv_qrnnd, which is defined in the source file gmp-3.1.1/longlong.h. This required a little bit of detective work to find and relate the assembler code back to the macro.

The branch at address 0x12001b8dc has a poor prediction rate (about 65%.) Examination of the macro udiv_qrnnd relates this conditional branch to the following statements:

#define __udiv_qrnnd_c(q, r, n1, n0, d)              \
  do {                                               \
    UWtype __d1, __d0, __q1, __q0, __r1, __r0, __m;  \
...
    if (__r1 < __m)                                  \
      {                                              \
	__q1--, __r1 += (d);                         \
	if ((__r1 >= (d)) && (__r1 < __m))           \
	    __q1--, __r1 += (d);                     \
      }	                                             \
    __r1 -= __m;                                     \
...

The conditional branch at 0x12001b8dc is generated from the first if test and the conditional branch at 0x12001b8f4 is generated from the second if test. Notice that in the second case, the compiler has evaluated the && as a bitwise AND operation instead of emitting a branch to implement the logical AND operation. (If optimization is disabled, the compiler will emit another branch.) C programmers can use this optimization explicitly to remove unnecessary branches, for example, by replacing the conditional expression:

((__r1 >= (d)) && (__r1 < __m))

with the bitwise logical expression:

((__r1 >= (d)) & (__r1 < __m))

Programmers must be careful to preserve correctness, however, as some conditional expressions depend upon an "early out," as in the case of testing for a NULL pointer before using the pointer elsewhere in the same conditional expression.

The macro code suggests that one or both branches could be eliminated by encouraging the compiler to generate conditional move instructions instead of conditional branches. An Alpha conditional move instruction such as:

cmoveq RA, RB, RC

transfers the contents of register RB to RC if the register RA is equal to zero. We rewrote the macro code in the following way:

#define __udiv_qrnnd_c(q, r, n1, n0, d) \
  do {                                                                  \
    UWtype __d1, __d0, __q1, __q0, __r1, __r0, __m, __t1; int __p1,__p2;\
...
    __p1 = (__r1 < __m);                                                \
    __q1 -= __p1;                                                       \
    __t1 = __r1 + (d);                                                  \
    __r1 = (__p1 ? __t1 : __r1);                                        \
    __p2 = __p1 & (__r1 >= (d)) & (__r1 < __m);                         \
    __q1 -= __p2;                                                       \
    __t1 = __r1 + (d);                                                  \
    __r1 = (__p2 ? __t1 : __r1);                                        \
    __r1 -= __m;                                                        \
...

The rewrite adds three new variables: the temporary __t1 and the two predicate flags __p1 and __p2. The new code uses the flags to conditionally increment the variable __q1 and to guard the transfer of new values into the variable __r1. The compiler should generate conditional move instructions to implement the C language conditional assignments. The flag __p1 will be the value of the first if test. The flag __p2 is the logical AND of __p1 and the value of the second if test. Combining conditions eliminates both if tests, resulting in "straight line" code without branches.

The rewritten code was compiled, executed and measured with DCPI. The predicate-based macro code compiles into the dcpilist excerpt below:

       retired 
cycles  :count freq   
...
  2976    6424  107                 udiv_qrnnd (q, r1, nx, workaround, dx);
...
     0     103  106        0x12001b918 mulq    v0, a1, t2
     0     100  106        0x12001b91c mulq    v0, s4, t3
   107     122  106        0x12001b920 srl     t5, 0x20, t7
     0      83  106        0x12001b924 zapnot  t5, 0xf, t6
     0     103  106        0x12001b928 subq    s2, t2, t2
     0     124  106        0x12001b92c mulq    t2, t4, t2
   108      91  106        0x12001b930 bis     t2, t7, t2
     0     130  106        0x12001b934 cmpult  t2, t3, a4
     0     102  106        0x12001b938 addq    t2, s3, a5
     0     238  106        0x12001b93c cmoveq  a4, t2, a5
   191     117  106        0x12001b940 subq    v0, a4, t1
     0      86  106        0x12001b944 cmpule  s3, a5, t8
     0      95  106        0x12001b948 cmpult  a5, t3, t9
     0     107  106        0x12001b94c addq    a5, s3, t10
   156      99  106        0x12001b950 and     a4, t8, t8
     0      91  106        0x12001b954 and     t8, t9, t8
     0     216  106        0x12001b958 cmoveq  t8, a5, t10
     0     120  106        0x12001b95c subq    t1, t8, t1

The conditional branch instructions have been removed and the overall cycle count for this region of code is lower.

Unfortunately, in this case, the conditional move approach does not provide the gain that we had hoped for. The table below compares cycle and retire samples for the two versions of ecm4c as a whole.

Version	Cycles	Retired
Baseline	193,926	427,465
Predicate	192,339	428,635

The predicate version retires more instructions because the processor maps each conditional move into two internal instructions and thus, each conditional move appears to be counted twice. Although the number of cycle samples is slightly lower for the predicate version, it is not enough to substantially improve execution time. In fact, any improvement in time is covered by the measurement error and no discernible improvement in execution time was noted. This is not to say that the technique is ineffective, but simply that it did not yield an improvement in this particular case. The overall share of cycles attributed to gmpn_sb_divrem_mn decreased by only a small amount, from roughly 6.09% to 5.24%.

Conditional move instructions can also be manually inserted into the source code using the asm built-in function. Source modules must include /usr/includ/c_asm.h in order to use asm function calls. The include file defines asm as intrinsic such that the compiler will perform special processing on any call to asm.

The first parameter to asm is a sequence of Alpha instructions. The instructions follow the syntax illustrated in c_asm.h. Further documentation can be found in the Tru64 UNIX Programmer's Guide. Register usage follows the Alpha call convention: registers %a0 through %a5 contain incoming arguments and register %v0 contains the return value. The function asm returns an integer value. Single and double precision versions, called fasm and dasm, are also available.

Here is a version of the macro __udiv_qrnnd_c that uses asm to insert conditional moves into the code stream:

    __p1 = (__r1 < __m);                                     \
    __q1 -= __p1;                                            \
    __t1 = __r1 + (d);                                       \
    __r1 = asm("bis %r31, %a0, %v0;"                         \
               "cmovne %a1, %a2, %v0;", __r1, __p1, __t1);   \
    __p2 = __p1 & (__r1 >= (d)) & (__r1 < __m);              \
    __q1 -= __p2;                                            \
    __t1 = __r1 + (d);                                       \
    __r1 = asm("bis %r31, %a0, %v0;"                         \
               "cmovne %a1, %a2, %v0;", __r1, __p2, __t1);   \
    __r1 -= __m;                                             \

The assembler instructions replace the C language conditional assignment statements that we used in the predicated version of the macro. The bis instructions above are actually move operations that initialize the register %v0 to the old value of __r1. The cmovne instructions test the predicate flag, and if the flag is non-zero, move the value of __t1 to the result register %v0.

If optimizations are enabled, the compiler will expand the asm calls in-line and will merge the instructions into the generated code stream. This means that registers other than %a0, %a1, etc. will convey arguments and the result. The substitutions result in tighter, faster code. If optimizations are disabled, possibly through the use of the -g debug switch, then code to set up %a0, %a1, etc. will be generated, introducing additional runtime overhead. The extra overhead may then nullify any expected gain in performance.

4.7 High average retire delay

In this section, we will demonstrate a quick way to find instructions with high average retire delay. High average retire delay may indicate the presence of a long latency load operation.

The fastest way to identify the instructions with the largest average retire delay is to use the dcpitopcounts command. The following command:

dcpitopcounts -n 20 -pm valid:retdelay::valid ecm4c.test

will display the top 20 instructions in ecm4c.test with the highest average retire delay. The command produces:

  retired
:retdelay
       ::
  retired
   :count  procedure                        pc  instruction                
    15.00  lucas_cost                120007030  addt    $f0,$f1,$f0
    14.00  prac                      120007348  addt    $f0,$f10,$f0
    12.00  prac                      120007350  stt     $f0, 88(sp)
     9.00  prac                      12000763c  br      zero, 0x12000791c
     9.00  __gmpn_gcdext             120014014  bne     t9, 0x120014128
     7.80  __gmpz_set                12000fd74  cmplt   t0, a1, t0
     7.17  __gmpz_tdiv_r             12001567c  cvtqt   $f1,$f1       
     7.00  lucas_cost                1200070d0  srl     t1, 0x21, t1
     7.00  lucas_cost                120007178  srl     t1, 0x21, t1
     7.00  prac                      12000753c  srl     t4, 0x22, t4
     7.00  __gmpn_sb_divrem_mn       12001b934  bis     t2, t6, t2
     6.57  __gmpn_sb_divrem_mn       12001b8d4  bis     t2, t7, t2
     6.20  __gmpn_sb_divrem_mn       12001b8cc  subq    s2, t2, t2
     6.01  __gmpn_sb_divrem_mn       12001b92c  subq    t2, t9, t2
     6.00  __gmpz_add                12000c6b0  bsr     ra, 0x12000c5a0
     6.00  __gmpn_divmod_1_internal  120011c28  ldbu    a3, 0(a3)
     6.00  __gmpn_gcdext             12001497c  stq     t1, 160(s6)
     6.00  __gmpn_tdiv_qr            120019f58  ldq     t4, -8(s3)
     5.67  __gmpn_gcdext             120014cd0  subq    s0, a4, s1
     5.54  __gmpn_mul_1              120015df0  stq     t2, 0(a0)

We should note that none of these instructions have a big enough average retire delay to suggest a problem with data cache behavior. (Twenty cycles is a reasonable threshold for an Alpha 21264A-based XP-1000 workstation.) However, we will investigate and discuss possible causes for higher average retire delay.

The next step in the analytical process is to examine the context of each instruction to see if it consumes a value produced by a load instruction. This can be accomplished by using dcpilist to disassemble and display each instruction, using either the more command or a text editor to browse through the disassembled code. Three candidate instructions are in the procedure lucas_cost, which can be disassembled and annotated with the following command:

dcpilist -event cycles -pm valid:retdelay::valid lucas_cost emc4c.test > lucas

The first instruction at address 0x120007030:

lucas_cost:
           valid
       :retdelay 
              :: 
           valid
cycles    :count freq   
...
     0    4.6667    6        0x12000702c   divt    $f0,$f17,$f0
     7   15.0000    6        0x120007030   addt    $f0,$f1,$f0

is a floating point add instruction that consumes the value produced by the double precision divide instruction that immediately precedes it in program order. Floating point divide is a long latency operation (12 cycles single precision, 15 cycles double precision), so it's not too surprising that the consumer was delayed by 15 cycles.

The next instruction with high average retire delay in lucas_cost is at address 0x1200070d0:

     0    0.9091   10        0x1200070c4   mulq    t0, t1, t1
     0    0.0000   10        0x1200070c8   zapnot  t3, 0xf, t3
     0    1.0625   10        0x1200070cc   mulq    t0, t3, t3
     9    7.0000   10        0x1200070d0   srl     t1, 0x21, t1

This instruction is a left shift that consumes the output of the integer multiple instruction that precedes it at address 0x1200070c4. Integer multiply has a result latency of 7 cycles. It's coincidental that the average retire delay should equal the latency of integer multiply. Some of the latency could have been "hidden" by the instructions between the first multiply and the consuming shift instruction -- instructions that could have been executed while the first multiply was in progress. The third instruction in lucas_cost with high average retire delay is at address 0x0120007178:

     0        NA    1        0x120007174   mulq    t0, t1, t1
     0    7.0000    1        0x120007178   srl     t1, 0x21, t1

and it also consumes the output of an integer multiply instruction.

Most of the remaining instructions in the list can be eliminated for one of two reasons:

The instruction consumes a value produced by a high latency instruction (floating point divide, convert from float to integer, integer multiply, etc.)
The instruction was not executed a sufficient number of times to produce a statistically valid result (e.g., retire delay of 18 divided by 2 retire samples equals an average retire delay of 9, an artificially large delay.)

The second point raises an important general issue: when are measured results statistically valid? As a useful rule of thumb, an instruction should receive 100 or more retire samples in order to be valid. Several of the instructions in the dcpitopcount's list had only 1 or 2 retire samples -- not enough to yield a significant result.

The instruction at address 0x12000fd74 in the procedure __gmpz_set could potentially be the victim of a long latency load. It consumes the result of a load instruction:

__gmpz_set (set.c):
            valid
        :retdelay 
               :: 
retired     valid
 :count    :count freq   
...
      2    2.0000    3        0x12000fd44   ldl     t0, 0(a0)
...
      5    7.8000    3        0x12000fd74   cmplt   t0, a1, t0
...

We found the load instruction by tracing back from the instruction at 0x12000fd74 to find the instructions that produced the register operands t0 and a1. The trace-back of t0 ended with the load instruction at 0x12000fd44. Unfortunately, neither instruction was executed enough times to give much statistical confidence in the result.

4.8 Hand-tuned assembler code

The procedure __gmpn_addmul_1 consumed the most cycles and was clearly the hottest procedure in the ecm4c.test program. It consists of assembler code that was carefully tuned for maximum performance on the Alpha 21264. The hand-tuned assembler code achieves its speed through loop unrolling and software pipelining. The programmer carefully controlled the assignment of instructions to functional units in order to keep them as busy as possible and to minimize instruction stalls. Its retire/cycle ratio is 2.64, which is near the "theoretical" target of 3.3 retires per cycle for integer code (assuming that load operations hit in the L1 D-cache.) We performed some experiments to improve the code schedule and simply were not able to improve performance through instruction scheduling.

The procedure __gmpn_addmul_1 has eight code regions as summarized in the table below.

**Code regions in `__gmpn_addmul_1`**
Region	"Temperature"	Size
A	Hot	2
B	Warm	48
C	Hot	12
D	Warm	7
E	Cold	42
F	Hot	50
G	Cold	115
H	Hot	57

A region is a sequence of related instructions and may consist of one or more basic blocks. Region "temperature" indicates the relative execution frequency of a region. Hot regions are executed frequently, warm regions are executed infrequently, and cold regions are not executed at all. A performance increase may sometimes be obtained through "hot-cold" procedure optimization, routine splitting and chaining. These optimizations are performed by the tool Spike. Spike rearranges procedures to place frequently called procedures close to each other in memory and rearranges code to put hot regions together. Control flow between regions is changed to permit a straight line flow from one hot region to the next and cold regions are moved out of line. Entry points to basic blocks are padded with NOP instructions to increase fetch efficiency. Overall, these changes improve L1 I-cache performance and instruction fetch behavior of the program.

The table below shows the order of the regions in __gmpn_addmul_1 after Spike'ing ecm4c.test. The hot regions have been pulled together at the top of the procedure followed by the warm regions. The cold regions have been moved out of line to a separate area in memory.

**Code regions in `__gmpn_addmul_1` after Spike'ing**
Region	"Temperature"
A	Hot
C	Hot
F	Hot
H	Hot
B	Warm
D	Warm
...	...
E	Cold
G	Cold

Using Spike reduced execution time to 41.3 seconds from 41.8 seconds, which was the execution time for the baseline version of ecm4c.test.

5. Performance measures (more detail)

This section provides some additional details about the performance measures.

Measure: cycles

Interpretation: Number of elapsed processor cycle samples; cycles directly correspond to time
Applicability: System, image, procedure, instruction
Goodness: Smaller is better
Option: -event cycles

The number of processor cycles is a measure of time. At the system level, it reflects the total number of processor cycles on a system-wide basis that were needed to complete a task. System level measurements may include cycles that were spent on non-relevant tasks such as unrelated transient interrupts from the network, the user interface, etc. Thus, it is important to take measurements on a quiet system with as little non-relevant activity as possible.

At the image and procedure levels, cycles indicate the amount of time spent executing code within an image or procedure, respectively. Images or procedures that take the most cycles to execute are the most lucrative targets for improvement. Executable components such as shared libraries, PALcode and the operating system may contribute to the total number of relevant cycles as the known application program components themselves.

Cycles at the instruction level indicate the amount of time attributed to the execution of individual instructions. On out-of-order execution machines like the Alpha 21264 or 21364, cycles cannot be attributed precisely and cycle counts indicate the neighborhood of a hot spot. Cycle samples are often attributed to the first instruction in a fetch block (also known as a 16 byte "quad pack.")

Summarizing, if you want to measure all of the cycles spent running a workload, use cycle samples. On Alpha 21264 and other out-of-order execution machines, time that should be charged to one instruction may be charged as cycle samples to an instruction that is up to 80 instructions later in program order. Cycle samples are fine for dcpiprof summaries, but may be confusing at the instruction level.

Measure: retired

Interpretation: Number of successfully completed instructions
Applicability: System, image, procedure, instruction
Goodness: Smaller is usual better
Option: -pm retired

An instruction is "retired" when the instruction and all preceding instructions (in program order) are past the point at which exceptions and mispredicts are detected. Retirement commits the results of the instruction to the architectural, program state and frees physical registers that are no longer needed. The Alpha 21264 can retire up to eleven instructions in a single cycle. Instructions are "in flight" from the time they are mapped to the time that they are retired.

Because a retired instruction advances the state of program execution, it is, in a sense, a "useful" instruction. It has computed a value that directly contributes to the intended result. Useful computational work has been performed and committed when an instruction retires.

At the system, image, procedure and instruction levels, the number of retired instructions indicates the proportion of successfully executed instructions for a given component. Since a retired instruction is "useful" in the sense defined above, it can be used in ratios to compare the amount of useful work completed versus the number of events that degrade performance.

The following special cases apply to instruction counts.

Squashed NOP instructions count as valid fetched instructions.
A squashed instruction does not count as a retire (e.g., NOP, TRAPB.)
A conditional move (CMOV) instruction is decomposed into two valid fetched instructions that, in the absence of stalls, are fetched in consecutive cycles.
Every non-squashed valid fetched instruction either aborts or retires, but not both. In either case, the instruction may also trap.

Measure: cycles / retired

Interpretation: Average number of cycles needed to execute a retired instruction
Applicability: System, image, procedure, instruction
Goodness: Smaller is better
Option: -event cycles -pm retired

Measure: retired / cycles

Interpretation: Number of instructions successfully completed per processor cycle
Applicability: System, image, procedure, instruction
Goodness: Bigger is better
Option: -event cycles -pm retired

The cycle to retired instruction and retired instruction to cycle ratios are clearly related to each other (they're reciprocals of one another.)

The cycle to retired instruction ratio is the number of cycles required to complete and commit a retired instruction. Ideally, this ratio is as small as possible since less time is required, on average, to execute a retired instruction. A ratio of less than one indicates concurrency since an instruction is effectively being completed in less than a full processor cycle.

The retired instruction to cycle ratio is the number of retired instructions that are produced in a single processor cycle. A ratio that is greater than one indicates concurrency.

The Alpha 21264 is able to retire up to eleven instructions in a single cycle. It can sustain a rate of eight retires per cycle for short periods of time. This peak figure is further limited in real code by the availability (or shortage) of physical integer and floating point registers. In integer code with D-cache references, a maximum of 3.3 retires per cycle could be achieved, but is quite unlikely in practice. In floating point code with D-cache references, the maximum is 2.3 floating point instructions per cycle, again, a maximum that is unlikely to be achieved in real code. Thus, a retire/cycle ratio of 3 in integer code and a retire/cycle ratio of 2 in floating point code are exceptionally good. If the ratio is in the neighborhood of one retire per cycle or less, then further improvement is possible and investigation is warranted.

The sampling period for cycle and retired instruction events must be equal, or the sample counts must be normalized to account for the different sampling periods. This requirement must be satisfied for all composite performance measures (e.g., ratios.) The sampling "weight" of one event in a composite measure must be the same as the "weight" of any other event in the measure.

Measure: retired / aborted

Interpretation: Number of successfully executed instructions per aborted (incomplete or killed) instructions
Applicability: System, image, procedure, instruction
Goodness: Bigger is better
Option: -pm retired::/retired

Instructions that did not retire are aborted instructions. Aborted instructions did not complete and did not advance the program state. Instructions may not complete for a variety of reasons such as mispredicted control flow and memory system replay traps. Instructions caught in the shadow of a trap will not complete and are aborted.

Aborted instructions consume pipeline resources. Depending on how far an aborted instruction has progressed, it may consume an instruction number (Inum), a physical register for its result, an instruction queue slot, and a functional unit. An instruction may be aborted (killed) early in the pipeline, before it enters an issue queue. An instruction may be aborted (killed) late in the pipeline, after it enters an issue queue. The early_kill and late_kill ProfileMe event types can be used to measure the number of early and late kills, respectively. Instructions that are killed late in the processor pipeline are more costly than early kills.

The ratio of retired to non-retired (aborted) instructions indicates the relative number of useful, completed instructions to wasted, discarded instructions. Naturally, a high ratio is desired because that indicates efficient use of pipeline resources.

Measure: retired / trap

Interpretation: Number of useful instructions successfully retired per trap of any type (mispredict, replay, etc.)
Applicability: System, image, procedure, instruction
Goodness: Bigger is better
Option: -pm retired::trap

The pipeline traps when the hardware has detected a condition that must be corrected. Two common kinds of traps are control flow mispredictions and memory system replay traps. (These will be explained a little later.) The processor must restore its internal state to a point from which recovery can be performed. This often means aborting certain inflight instructions, thereby discarding intermediate results.

The ratio of retired instructions to the number of traps indicates the relative frequency of instruction traps. Since traps are wasteful, a high ratio is desirable (i.e., relatively infrequent pipeline traps.)

A pipeline trap is not the same as a software trap. Pipeline traps correct internal processor conditions to maintain architecturally correct execution of a program. Software traps yield execution to the operating system via a much different mechanism.

Measure: retired / replay traps

Interpretation: Number of useful instructions performed per replay trap
Applicability: System, image, procedure, instruction
Goodness: Bigger is better
Option: -pm retired::replays

Memory system replay traps respond to dynamic conditions that are detected in the memory system. Certain conditions cannot be detected statically by analyzing the program stream. Sometimes, these conditions depend upon the effective address computed by a load or store instruction. For example, the hardware cannot detect a producer-consumer dependency between a store instruction and a consuming load until it evaluates both effective addresses and realizes that they are the same address. Due to aggressive instruction scheduling, the processor may need to change the order of execution of the load and store to preserve correctness. This kind of replay trap has a special name, load-store order replay trap, and is reported by the DCPI/ProfileMe event name ldstorder.

The ratio of retired instructions to replay traps is an indicator of the relative frequency of replay traps. Since replay traps result in wasted (aborted) instructions, they should be avoided.

Be careful to exclude the operating system idle loop from system level analysis. The idle loop gets good ILP and doesn't trap much.

Measure: retired / mispredict

Interpretation: Number of useful instructions performed per control flow mispredictions (conditional branch, jumps, subroutine calls and returns)
Applicability: System, image, procedure, instruction
Goodness: Bigger is better
Option: -pm retired::mispredict

The Alpha 21264 pipeline needs a steady incoming flow of instructions to keep its functional units and memory subsystem busy. In regions of "straight-line" code with no branches, the processor can freely fetch ahead as fast as possible to keep the flow of instructions coming. When the processor encounters a conditional branch or JSR group instruction (jump, jump to subroutine, return from subroutine, and jump to coroutine) it must make a decision as to where to continue fetching instructions. The processor uses recent program execution history to predict the location of the target and to continue fetching from there. Instructions beyond the conditional branch or JSR group instruction are fetched and enter the back end of the pipeline beginning execution.

If the processor predicts the conditional branch or jump correctly, the instructions will execute, complete and retire. If the processor predicts incorrectly, the instructions beyond the conditional branch or JSR group instruction must be discarded (aborted.) Clearly, accurate prediction is important to good performance.

At the system and image levels, the ratio of retired instructions to mispredicted instructions is a rough indicator of prediction accuracy. The mispredict and retire data at the system and image levels is aggregated across all conditional branches and jumps in the program. Exceptionally good prediction accuracy in some code/loops may mask the presence of performance degrading mispredicts in other code/loops. Thus, at the system and image levels, this ratio is not a perfect indicator, but is still somewhat useful. The ratio performs better at the procedure and instruction levels, especially if each procedure is small and is not subject to masking due to aggregation of sample data.

At the instruction level, the DCPI/ProfileMe mispredict event is valid for conditional branch, JSR, RET, JMP, and JSR_COROUTINE instructions only. The ratio is the misprediction rate for the instruction. The difference,

1 - (mispredict/retired),

is the prediction rate or prediction accuracy for the instruction.

High prediction accuracy is needed to provide a "window" of useful instructions for execution. For a given prediction accuracy, p, and basic block size, B, the effective "window" size is:

B + p · B + p² · B

when fetching through two conditional branches. Basic block size is typically 5 to 6 instructions. The Alpha 21264 may have up to 80 instructions in flight at once. In order to achieve this, prediction accuracy must be very high in code with many conditional branches and short basic blocks. For example, this requires 99%+ accuracy when fetching through 10 branches with a 6 instruction basic block size. If the basic block size doubles to 12 instructions, less accuracy is required (93%.)

Measure: conditional branch mispredicts / retired

Interpretation: Number of conditional branch mispredictions per retired conditional branch
Applicability: Conditional branch instructions only
Goodness: Smaller is better
Option: -pm cbrmispredict::retired

The DCPI/ProfileMe event cbrmispredict applies to conditional branch instructions. The ratio of conditional branch mispredicts to retired instructions is the conditional branch misprediction rate. An aggregate conditional branch mispredict rate can be computed at the system, image and procedure levels. Again, exceptionally good performance in some loops/code may mask problems in other loops/code due to aggregation.

At the instruction level, the ratio is the misprediction rate for individual conditional branch instructions. The Alpha 21264 branch prediction algorithm is table-driven. Conditional branch mispredicts are due to:

Inadequate capacity (prediction table too small),
Aliasing or destructive interference between branches (two or more branches mapping to the same table entry), and
Inherently unpredictable branches (some conditional branches are just plain unpredictable.)

Remedies include elimination of conditional branches to lengthen basic block size or changes in algorithm design to increase the conditional branch prediction rate.

Measure: JSR group mispredicts / retired

Interpretation: Number of JSR group instruction mispredictions per retired instruction
Applicability: System, image, procedure, instruction
Goodness: Smaller is better
Option: -pm /cbrmispredict^mispredict::retired

As noted above, the DCPI/ProfileMe mispredict event is valid for both conditional branch instructions and JSR group instructions. Unlike conditional branch misprediction events, a separate event is not provided for JSR group misprediction events. Instead, the subset of JSR group misprediction events can be derived from all mispredictions using the event expression:

/cbrmispredict^mispredict

This expression selects the sample data for mispredict events (mispredict) which are not conditional branch mispredict events (/cbrmispredict.) The sample count for the subset is then divided by the retire count, obtaining the rate in the usual way.

Interpretation of the JSR group misprediction rate is analogous to the interpretation of the conditional branch misprediction rate.

Measure: retired / DTB miss

Interpretation: Number of useful instructions performed per Data Translation Buffer miss
Applicability: Image, procedure, instruction
Goodness: Bigger is better
Option: -pm retired::dtbmiss

Measure: retired / ITB miss

Interpretation: Number of useful instructions performed per Instruction Translation Buffer miss
Applicability: Image, procedure, instruction
Goodness: Bigger is better
Option: -pm retired::itbmiss

A translation look-aside buffer (TLB) is a small memory that caches recently used virtual to physical address mapping information. A TLB is a table where each entry describes the mapping of a virtual memory page to a physical memory page. (The entry can handle multiple contiguous pages if granularity hints are used.)

The Alpha 21264 has two TLBs: an Instruction Translation Buffer (ITB) and a Data Translation Buffer (DTB.) The ITB handles instruction fetches and the DTB handles data reads and writes. The ITB and DTB each have 128 entries. If the page size is 8,192 bytes, then each Alpha TLB has a 1Mbyte reach, that is, each TLB provides a 1Mbyte "window" into the virtual memory space.

When information for a virtual address cannot be found in the appropriate TLB, then one of the entries must be updated with information for the new target page. The Alpha PALcode is responsible for handling the TLB miss and for updating the TLB with the new page information.

The ratio of retired instructions to ITB misses and the ratio of retired instructions to DTB misses indicate the relative frequency of ITB and DTB misses, respectively. Since ITB and DTB miss handling is relatively expensive, higher ratios (infrequent ITB and DTB misses) are desirable.

When computing DTB and ITB miss ratios at the system level, do not include the number of retired instructions attributed to the ITB/DTB miss handling routines in the PALcode. The number of retired instructions in the miss handling routines should be subtracted from the total number of retired instructions as miss handling instructions are computational overhead. Failure to do so will include extra retired instructions in the numerator of the ratio and will make the situation appear to be better than it is.

Measure: retired / nyp

Interpretation: Number of retired instructions per instructions not yet prefetched
Applicability: System, image, procedure, instruction
Goodness: Bigger is better
Option: -pm retired::nyp

The Alpha 21264A does not provide a ProfileMe event for L1 I-cache and D-cache misses. The DCPI/ProfileMe event nyp is asserted when a profiled instruction is contained in an aligned 4-instruction I-cache fetch block that requested a new I-cache fill stream. Thus, "nyp" means "not yet prefetched." The counters are blind to many I-cache misses; the nyp bit is not set if the prefetcher asked for the instruction a mere cycle before the fetcher needed it, for example. Therefore, nyp undercounts I-cache miss events and is a lower bound on the number of I-cache misses.

The ratio of the number of retired instructions per instructions not yet prefetched is a rough indicator of the frequency of L1 I-cache misses. The reciprocal ratio:

nyp / retired

is a rough approximation of the I-cache miss rate. A low I-cache miss rate is a goal for good performance.

Measure: Average retire delay

Interpretation: The average number of cycles that a profiled instruction delayed retirement of itself and other instructions
Applicability: Any instruction, but most importantly, instructions that consume the result of a load instruction
Goodness: Smaller is better
Option: -pm valid:retdelay::valid

An instruction's retire delay is a lower bound on the number of CPU cycles that the instruction has delayed its own retirement and the retirement of other instructions following it in program order. High retire delay may indicate the presence of a performance bottleneck.

Ideally, the total, system-level retire delay would equal the total number of cycles, since retire delay would charge all spent cycles to instructions. Unfortunately, the Alpha 21264A has "blind spots" in its performance counters and total retire delay is less than total cycles. (During a blind spot, the hardware event counter stops incrementing and does not count events.) On Alpha 21264A, retire delay has blind spots that effectively exclude most of the cycles spent aborting instructions and refilling the pipeline after a pipeline trap. Retire delay, for example, is blind to I-cache miss stalls. Retire delay (retdelay) is not blind to load misses, however, making retire delay a useful tool in assessing load latency.

The gap between cycles and retire delay is given by the formula:

cycles - retire delay

This is a good estimate of the cycles spent:

Waiting for L1 I-cache misses,
Aborting instructions killed by pipeline traps, and
Restarting the pipeline after a pipeline trap.

Items 1 and 2 above will be fixed in Alpha 21364. The formula does not account for cycles lost because pipeline traps destroyed instruction-level parallelism (ILP.) Further, there is no way to compute the number of cycles lost due to specific trap events like mispredicts. Lack of this information limits trap analysis.

Retire delay is especially valuable for finding high latency load operations. The ratio:

retired delay / number of valid instructions

gives the average retire delay. The ratio can be used at the instruction level to identify load-miss opportunities (i.e. long latency load operations.) To find load-miss opportunities, start with the command:

  dcpitopcounts -pm valid:retdelay+valid:retdelay::valid

This command will sort instructions by total retire delay and the output can be further used to filter out only the instructions whose average retire delay exceeds a chosen threshold. Next, examine the candidates and find instructions whose register operands are produced by a preceding load instruction (a load-use pair.) That will identify instructions that are:

The victims of load misses to distant memory or to I/O space. (The victim instruction will be the first instruction that tries to use the result of the load.)
The victims of memory barrier-related stalls, which should appear on the first load or store after a memory barrier.

The converse is not true, that is, not every victim of a load miss to distant memory will be identified because a stall between a load miss and the victim will mask part or all of the stall on the victim.

Using DCPI/ProfileMe performance measures to assess program performance

1. Introduction

1.1 A quick overview of DCPI

1.2 DCPI in action

1.3 Levels of detail

1.4 What comes next?

2. Common performance problems

2.1 Control flow mispredictions

2.2 Memory system replay traps

2.3 Memory system delays

2.4 Resource availability

3. Performance measures (overview)

3.1 Overall assessment

3.2 Control flow mispredictions

3.3 Memory system replay traps

3.4 Memory system delays

3.4.1 Hits and misses

3.4.2 Retire delay

4. Integer factorization: an example

4.1 Preliminary comments

4.2 Specifying events and ratios to DCPI

4.3 System-level assessment

4.4 Image-level assessment

4.5 Procedure-level assessment

4.6 Instruction-level assessment

4.7 High average retire delay

4.8 Hand-tuned assembler code

5. Performance measures (more detail)

Measure: cycles

Measure: retired

Measure: cycles / retired

Measure: retired / cycles

Measure: retired / aborted

Measure: retired / trap

Measure: retired / replay traps

Measure: retired / mispredict

Measure: conditional branch mispredicts / retired

Measure: JSR group mispredicts / retired

Measure: retired / DTB miss

Measure: retired / ITB miss

Measure: retired / nyp

Measure: Average retire delay

Using DCPI/ProfileMe performance measures
to assess program performance