The HP Continuous Profiling Infrastructure (DCPI) is a powerful tool for the analysis of dynamic program behavior. With DCPI, a developer can collect and analyze program profile data to find performance bottlenecks.
This paper describes a collection of performance measures to detect and isolate performance issues. We illustrate the use of these measures by analyzing a small example, a program to perform integer factorization using the Elliptic Curve Method.
DCPI uses sampling to collect data about programs as they execute. This data is provided by special registers in the Alpha processor hardware. The DCPI data collection subsystem periodically reads this data from those registers and stores it in a database. The DCPI database is stored in a directory, which contains a number of subdirectories, and of course, files containing the program profile data. Sample data is organized by epoch (the time period in which data was collected), process ID and executable image. DCPI collects data about programs that run during an epoch giving the developer a system-wide view of execution. Once profile data has been collected, it can be analyzed and displayed using any of several analytical tools.
Alpha 21064, 21164 and 21264 processors implement a set of "aggregate" performance counters that count hardware events like D-cache miss, branch mispredictions, etc. Using a random sampling period, the data collection subsystem interrupts the running program and writes the current program counter value and hardware counter values to the database. This method of sampling is called "PC sampling."
Alpha 21264A processors (and later) use a different method called "instruction sampling." PC sampling on out-of-order execution engines like the Alpha 21264 smears and skews sample data and profile information cannot be precisely attributed to specific instructions. Instruction sampling solves this problem by periodically selecting a specific instruction and collecting data about it as it flows through the processor pipeline. The program counter is known precisely as well as the execution history of the instruction. The problems of smear and skew are eliminated. Like PC sampling, the sampling period is randomized to get a statistically meaningful estimate of program behavior.
The Alpha implementation of instruction sampling is called "ProfileMe" and it is the focus of this paper.
Here is a very brief example of DCPI in action. Collection and analysis takes seven steps:
mkdir
)
DCPIDB
, to remember
the path to the DCPI database (setenv
)
su
)
dcpid
)
dcpiquit
)
dcpiprof
, dcpilist
,
dcpitopcounts
, etc.)
The session transcript below shows these seven steps. The command
lines are numbered, e.g., "3 >
" and the numbers correspond
to the steps outlined above.
The1 > mkdir /dsk0h/dcpidb 2 > setenv DCPIDB /dsk0h/dcpidb 3 > su Password: 4 > dcpid $DCPIDB dcpid: load : loaded pcount driver dcpid: monitoring pm dcpid: monitoring cycles dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log 5 > ecm4c.test 1000000 63844855 < big_number GMP-ECM 4c, by P. Zimmermann (Inria), 16 Dec 1999, with contributions from T. Granlund, P. Leyland, C. Curry, A. Stuebinger, G. Woltman, JC. Meyrignac, A. Yamasaki, and the invaluable help from P.L. Montgomery. Input number is 44959025334433976986064813184161514864529598931996810690621976 170435025988493693912396407377545697917020929743416462709862460259766349010994 4575251386017 (153 digits) Using B1=1000000, B2=234709020, polynomial x^12, sigma=63844855 Step 1 took 24909ms for 12982265 muls, 3 gcdexts Step 2 took 22611ms for 5626744 muls, 17038 gcdexts ********** Factor found in step 2: 241421225374647262615077397 Found probable prime factor of 27 digits: 241421225374647262615077397 Probable prime cofactor 186226481390212213142337234455933956750339187108843670 8374700394762064021217072743463856958990845558484946068708307156081498461 has 127 digits 6 > dcpiquit dcpictl: quit successful 7 > dcpiprof -i -pm retired -event cycles
dcpid
command starts the DCPI collection subsystem, which collects processor
cycles and ProfileMe events.
The application program is called ecm4c.test
and it performs
integer factorization. The dcpiquit
command stops the
DCPI data collection subsystem. At this point in the session, DCPI
has created a database and we're now ready to start analysis.
We'll take up the discussion with step 7 in the next section.
This section continues the example above and demonstrates the use of
two tools, dcpiprof
and dcpilist
. We'll
be using these two programs throughout the paper.
The DCPI analytical tools can produce four levels of summary information. The levels are:
We ended the example above with the DCPI command:
This command produces a system level summary and an image-by-image summary of retired instructions and processor cycles:7 > dcpiprof -i -pm retired -event cycles
In all of our examples, we will use ellipses (20 > dcpiprof -i -event cycles -pm retired Column Total Period (for events) ------ ----- ------ cycles 248126 126976 retired:count 517438 126976 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired cycles % cum% :count % image 193926 78.16% 78.16% 427465 82.61% ecm4c.test 27351 11.02% 89.18% 24005 4.64% /usr/shlib/libc.so 26322 10.61% 99.79% 65532 12.66% /vmunix 151 0.06% 99.85% 44 0.01% unknown@positron 126 0.05% 99.90% 108 0.02% /usr/bin/dcpid 67 0.03% 99.93% 19 0.00% /usr/shlib/libpthread.so 55 0.02% 99.95% 59 0.01% /sbin/loader 30 0.01% 99.96% 18 0.00% /usr/shlib/libXt.so 15 0.01% 99.97% 10 0.00% /usr/shlib/libX11.so 11 0.00% 99.97% 2 0.00% /usr/shlib/X11/libdix.so 11 0.00% 99.98% 1 0.00% /usr/shlib/X11/libos.so 9 0.00% 99.98% 1 0.00% /usr/dt/bin/dtterm 8 0.00% 99.98% 6 0.00% /usr/shlib/libmach.so ... 0 0.00% 100.00% 145 0.03% /dsk0h/dcpidb/PALcode
...
) to
denote additional, irrelevant, output that has been left out.
The -i
option directs dcpiprof
to produce
a system-level summary and an image-by-image breakdown of cycle and
instruction retire data.
The system-level summary is shown in the table at the top of the output:
During the collection epoch, 248,126 cycle samples and 517,438 retired instruction samples were counted. Each sample accounts for 126,976 events (on average.)Column Total Period (for events) ------ ----- ------ cycles 248126 126976 retired:count 517438 126976
The image-level summary shows the number of cycle and retired
instruction samples per executable image. Notice that the breakdown shows
every image in the system that was active during the data
collection epoch. Our program, ecm4c.test
, heads the list
as it is extremely compute-bound. The second entry, /vmunix
,
is the Tru64 operating system. The images ending with .so
are shared object libraries. The image /dsk0h/dcpidb/PALcode
is
the Privileged Architecture Library, which is the bridge between
user code and the operating system. Finally, the program
/usr/bin/dcpid
is the DCPI data collection daemon. Notice
that collection overhead is relatively low, a good property for a
profiling system!
The image-level summary helps identify candidates for further analysis.
In this case, our program image, ecm4c.test
, is the best
candidate to
examine further. A deeper look takes us to the procedure level. The
dcpiprof
tool also produces a procedure-by-procedure
breakdown of a particular executable image. We used the command:
to produce the following breakdown fordcpiprof -pm retired -event cycles ecm4c.test
ecm4c.test
:
The summary table on top shows that 193,926 cycle samples and 427,465 retired instruction samples were attributed to the procedures in21 > dcpiprof -event cycles -pm retired ecm4c.test Column Total Period (for events) ------ ----- ------ cycles 193926 126976 retired:count 427465 126976 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired cycles % cum% :count % procedure image 73297 37.80% 37.80% 200731 46.96% __gmpn_addmul_1 ecm4c.test 15166 7.82% 45.62% 27936 6.54% mpz_mod_n ecm4c.test 11816 6.09% 51.71% 12357 2.89% __gmpn_sb_divrem_mn ecm4c.test 9365 4.83% 56.54% 23107 5.41% __gmpn_sub_n ecm4c.test 8996 4.64% 61.18% 22006 5.15% __gmpn_add_n ecm4c.test 8450 4.36% 65.54% 14815 3.47% __gmpn_mul_basecase ecm4c.test 7563 3.90% 69.44% 16466 3.85% __gmpz_mul ecm4c.test 7270 3.75% 73.18% 13795 3.23% __gmpz_sub ecm4c.test 6322 3.26% 76.44% 12117 2.83% __gmpn_mul_1 ecm4c.test 6282 3.24% 79.68% 12756 2.98% __gmpz_add ecm4c.test 4770 2.46% 82.14% 9657 2.26% __gmpn_submul_1 ecm4c.test 4594 2.37% 84.51% 9315 2.18% __gmpn_sqr_basecase ecm4c.test ...
ecm4c.test
.
The hottest procedure, __gmpn_addmul_1
,
had 73,297 cycle samples and 200,731 retired instruction samples,
accounting for 37.8% and 46.96% of cycle samples and retired
instruction samples in ecm4c.test
.
Therefore, the procedure __gmpn_addmul_1
may be a good
candidate for even deeper analysis.
We used the command:
to produce an instruction-level analysis. Sincedcpilist -pm retired -event cycles __gmpn_addmul_1 ecm4c.test > addmul
__gmpn_addmul_1
is rather lengthy, we redirected the dcpilist
output into
a file. When we looked at the file with a text editor, we noticed that
__gmpn_addmul_1
had three separate regions of code that
were hot. Here is the dcpilist
output for
the first hot region of __gmpn_addmul_1
:
The instruction-level summary shows the distribution of samples for retire and cycle events across the individual instructions of__gmpn_addmul_1 (tmp-addmul_1.s): retired :count cycles freq ... 1558 1748 1586 0x120010fa8 : lda sp, -240(sp) 1604 2 1586 0x120010fac stq s0, 8(sp) 1616 1726 1586 0x120010fb0 stq s1, 16(sp) 1602 2 1586 0x120010fb4 stq s2, 24(sp) 1600 4 1586 0x120010fb8 stq s3, 32(sp) 1597 0 1586 0x120010fbc stq s4, 40(sp) 1562 1698 1586 0x120010fc0 stq s5, 48(sp) 1583 0 1586 0x120010fc4 stq s6, 56(sp) 1587 0 1586 0x120010fc8 and a2, 0x7, a4 1553 0 1586 0x120010fcc srl a2, 0x3, a2 1669 1761 1586 0x120010fd0 bis zero, zero, v0 1652 0 1586 0x120010fd4 beq a4, 0x12001109c ...
__gmpn_addmul_1
. DCPI analyzes the control flow
through procedures and computes the relative frequency of the
control paths. The freq
column is the result of DCPI's
analysis and it estimates the number of times an instruction was
executed (excluding speculative attempts to execute the instruction.)
The instruction-level summary is presented as an annotated disassembly
of the procedure under study. If source code is available, the
source and disassembly can be interspersed. See the man page for
dcpilist
for further information on this capability.
Without source, an engineer must be prepared to dig into and
understand the annotated assembly code.
Sections 1.2 and 1.3 demonstrated the basics of DCPI data collection
and analysis. Section 1.3 showed how to use dcpiprof
and dcpilist
to find candidates for analysis and how
to drill down to the instruction level.
The rest of the paper is organized in the following way. Section 2
describes the most common performance bottlenecks that may occur
when executing a program on Alpha 21264. Section 3 introduces a set
of performance measures that can help identify the bottlenecks
described in Section 2. Next, Section 4 applies the performance
measures to the example program ecm4c.test
.
Finally, Section 5 describes the performance measures in more detail.
The Alpha 21264 is a superscalar, pipelined processor. It achieves high throughput by identifying and exploiting instruction level parallelism (ILP.) ILP is exploited through pipelining (successive stages of parallel instruction execution), multiple functional units (four integer and two floating point), and concurrent computation while memory operations complete (through pending load and store queues and other memory system components.) Best performance is achieved when instructions and data flow through the processor without stalling the pipeline.
The 21264 is opportunistic. It issues instructions for execution as soon as they are data-ready, even if the instructions execute in a different order than they appear in the program. This is called "out-of-order" execution. Of course, the processor cannot and does not arbitrarily shuffle instructions. The Alpha 21264 enforces rules that guarantee program correctness. Out-of-order execution maximizes the use of functional units and the memory system.
The memory system maintains a queue of load instructions and a queue of store instructions. These queues let the pipeline continue program execution while memory operations proceed to completion.
There are three main kinds of performance problems that may arise during program execution:
The Alpha 21264 fetches instructions aggressively to keep its pipeline busy. The processor fetches ahead of the current architectural state of the program in order to produce a sufficiently steady stream of instructions into the pipeline. In straight-line code, the processor just needs to fetch ahead of the program counter as it increments along through memory. But, all programs have conditional branches and jumps that change the flow of control. To keep the instructions coming, the Alpha 21264 speculates about the likely outcome of conditional branches and jumps, and predicts the new program counter value and continues to fetch from there.
Fetched instructions flow into the pipeline stages for execution. In fact, an instruction on a speculated program path may begin execution before the actual branch condition is known! Once the condition is known, and if the prediction was correct, the instruction is known to be on the good path and execution continues until the instruction completes and retires. (When an instruction retires, the processor has committed the result of the instruction to the architectural state of the machine; program execution has "officially" advanced.) If the prediction was incorrect, the instruction is on the bad path, and any intermediate computation must be discarded. The processor must "back up" to the conditional branch (or jump) and begin fetching down the good path. The pipeline flush and restart operation is costly as well as the wasteful execution of bad path instructions.
Clearly, the instruction stream, and performance, depend upon the ability of the processor to correctly predict conditional branches and jumps. If the vast majority of conditional branches and jumps are predicted correctly, the flow of instructions into the pipeline will be fast and regular.
There a few things that a programmer can do to help with branch and jump prediction.
A = (cond ? B : C)
. The compiler can detect these
special cases and generate conditional move instructions instead
of a conditional branch.
As mentioned earlier, the Alpha 21264 uses out-of-order execution to exploit ILP. The processor maintains program correctness by preserving the producer-consumer relationships between instructions. The 21264 uses a technique called "register renaming" to analyze and preserve these relationships when instructions pass values between each other in registers. The register numbers (names) are right there in the instructions and the processor can perform on-the-fly dependency analysis on the instructions.
Data dependency between load and store instructions cannot be detected through on-the-fly analysis, however. Data dependency cannot be detected until effective addresses have been computed, compared, and found to refer to the same memory location. Thus, memory instructions must be issued and begin to execute deep in the pipeline before data dependency can be detected.
Here is an example. In the program excerpt below, the load instruction uses the result produced by the store instruction:
The value is passed through a common memory location. Due to out-of-order instruction issue, the consumer (LDQ) may issue and begin to execute before the producer (STQ.) When the store instruction issues and begins execution, the memory system detects that the load and store refer to the same memory location (i.e., the effective addresses are the same.) It also detects that an incorrect result will be produced if the load is allowed to complete since the correct value has not yet been written to memory by the store instruction. To correct this situation, the pipeline traps and replays the load instruction from the beginning. The trap and recovery mechanism is called a "replay trap" and they are expensive.... stq t0, 10(a0) ... ldq v0, 10(a0) ...
There are six main kinds of Alpha 21264 memory system replay traps:
Most replay traps can be avoided through careful programming practice. Here are some suggestions.
Data layout in memory is a critical, performance-oriented programming consideration and will become even more important with non-uniform memory access (NUMA) machines. Physical memory is organized into a hierarchy of levels where each level is an optimal trade-off between memory size and speed. Each level can be characterized by its size and latency, which is the number of processor cycles needed to read a value from that level into a CPU register and use it (load-use.) Here are capacity and load-use latency values for the XP-1000 workstation:
Level | Capacity | Latency |
Level 1 (L1) cache | 64Kbytes | 3 cycles |
Level 2 (L2) cache | 4Mbytes | 26 cycles |
Primary memory | 256Mbytes+ | 130 cycles |
Backing store | Gbytes and up | Millions of cycles |
The L1 caches communicate directly with the central processor and are the fastest in order to keep up with the CPU. Backing store, which holds non-resident virtual memory pages, is the biggest and slowest. NUMA machines subdivide primary memory into local and remote memory -- each with a different capacity and latency. Remote memory may be 1,000 to 2,000 processor cycles away. A memory request moves through the levels of the hierarchy (L1 -> L2 -> Primary memory -> Backing store) until the data item is found. The data item and some of its nearest neighbors are brought forward in the hierarchy under the assumption that the item and its neighbors will be used in the very near future.
Effective use of the memory hierarchy depends upon locality of reference. Locality is both spatial and temporal. Spatial locality refers to memory items that are near each other in memory. Temporal locality refers to memory items that are recently used. A program with good locality tends to use and reuse items within a small neighborhood of memory. Programs with poor locality jump between distant memory addresses and never reuse recently referenced data items.
Memory levels may also be characterized by the kinds of addressing needed to access a data item. On the XP-1000 workstation, the L1 caches use virtual addresses and the L2 cache and primary memory use physical addresses. Primary memory is subdivided into physical memory pages and page tables define the mapping from virtual to physical addresses. The page tables also describe where non-memory resident pages are in the backing store. The processor handles virtual to physical address translation in hardware because this mapping must be performed fast. The processor uses special memory tables called "translation look-aside buffers" (TLBs) to hold the most recently used virtual to physical address translation information. The TLBs have a finite capacity and must be updated whenever a page is used and that page cannot be found in the TLB. The update is not as costly as a page fault (i.e., the page is on backing store), but there is a non-trivial penalty nonetheless.
A symptom of poor locality is a long load-use latency. If the processor cannot find data in one of the closer, faster memories, then the memory request must go out to farther, slower memory units.
Here are some suggestions for achieving good locality and memory system performance.
The Alpha 21264 has a number of resources that are allocated dynamically. Each resource is allocated from a finite pool, and if the pool is empty, then the processor must react, stall, or recover in some way. Some of these resources require an understanding of arcane machine features (e.g., miss address file, victim buffer, I/O write buffer, etc.) and we will not discuss them here.
Two important resources that we can all relate to are the physical register pool and the functional units. Up to 80 instructions may be in flight at any one time in the Alpha 21264 pipeline. The pipeline uses two (integer and floating point) large pools of physical registers and register renaming to keep many instructions in flight. The processor tracks instructions and their register use with a tag-like entity known as an "Inum." Although it occurs infrequently, computationally dense code may run out of physical registers and Inums. This condition results in a "map stall" that persists until instructions in the "back end" of the pipeline retire and release physical registers and Inums.
The integer functional units have four 64-bit adders, four logic units, two barrel shifters with byte logic, two sets of conditional branch logic, one pipelined multiplier, and one pipelined unit to perform count extension (CIX) and multimedia extension (MVI) instructions. The floating point execution unit has one pipelined multiplier, one pipelined adder, one non-pipelined divide unit (associated with the FP adder pipeline) and one non-pipelined square root unit (associated with the FP adder pipeline.) These limitations affect the number of instructions of a particular type that can be executed concurrently in a single processor cycle.
We have not emphasized performance issues related to these resources since there is not really much that a C or FORTRAN programmer can do to affect register mapping and functional unit scheduling. The compiler makes decisions about register allocation and instruction scheduling and the programmer can affect these decisions indirectly, at best, through optimization switches and rearranging code at the source level. The Alpha 21264 then makes its own dynamic decisions through register renaming and instruction issue rules.
This section is a brief introduction to performance measures that can be applied to a program to assess its performance. Section 5 describes these measures in more detail including some important conditions to consider when applying them.
The table below is a list summarizing the measures that we will discuss. Some of these measures can be used to assess the program as a whole and generally indicate the presence of performance problems. Other measures deal with specific performance areas: control flow misprediction, memory system replay traps, and memory system delays.
Measure | Indication | Goal |
cycles | Cycles (time) spent | Decrease |
retired | Retired instructions | Decrease |
retired/cycles | Cycles needed to retire instruction | Increase |
retired/aborted | Useful work vs. wasted effort | Increase |
retired/trap | Useful work per trap | Increase |
retired/replay trap | Useful work per replay trap | Increase |
retired/mispredict | Useful work per mispredict | Increase |
cond br mispredict/retired | Conditional branch misprediction rate | Decrease |
jsr mispredict/retired | JSR group misprediction rate | Decrease |
ldstorder | Load-store order replay trap | Decrease |
replays | Memory system replay trap | Decrease |
retired/DTB miss | Useful work per DTB miss | Increase |
retired/ITB miss | Useful work per ITB miss | Increase |
nyp/retired | Lower bound I-cache miss rate | Decrease |
retire delay/valid instructions | Average retire delay | Decrease |
Seven measures can be used to assess the program as a whole and to
perform a little high-level "triage." (See the next table below.)
In our earlier example,
we used processor cycle samples (cycles
) and
the number of retired instructions (retired
) to
identify hot spots in the application. In addition to this role,
cycles and retired instructions provide a basis for comparison
through ratios. Although one can compare the raw number of replay
traps from different experimental runs, ratios offer a few advantages
over raw, absolute event counts.
Measure | Indication | Goal |
cycles | Cycles (time) spent | Decrease |
retired | Number of instructions retired | Decrease |
retired/cycle | Instructions retired per cycle | Increase |
retired/aborted | Useful work vs. wasted | Increase |
retired/trap | Useful work per trap | Increase |
retired/replay trap | Useful work per replay trap | Increase |
retired/mispredict | Useful work per mispredict | Increase |
Retired instructions per cycle (and its reciprocal) is a rough measure of concurrency, i.e., how well the machine is exploiting ILP. If the processor is retiring, on average, more than one instruction per cycle, then concurrency is being realized. Architectural research shows that 3.3 retires per cycle can be achieved in integer code with loads that hit in the D-cache, but this is unlikely to be sustained. A retire/cycle ratio that approaches 3 is very good for integer code. Similar analysis for floating point shows a maximum of 2.3 retires per cycle for short periods. A retire/cycle ratio approaching 2 is very good for floating point code.
The four ratios:
retired / aborted,provide a quick way to assess the effect of aborted (discarded or killed) instructions, traps of all kinds, memory system replay traps specifically, and control flow mispredictions (conditional branch and jumps.) The ratio of retired instructions to aborted instructions is a rough measure of efficiency, that is, how much useful work is done for work that is discarded. A better measure of efficiency would be the number of cycles lost due to aborted instructions (via traps.) Unfortunately, due to limitations of the performance measurement hardware in the Alpha 21264A, we cannot quantify this notion of efficiency. It's hard to specify a particular retire/abort threshold. Two retires per abort is clearly bad. The threshold depends upon one's tolerance of inefficiency. It is probably impossible to remove all traps, especially in code with memory traffic.
retired / trap,
retired / replay trap, and
retired / mispredict
The ratio of retired instructions per trap and the ratio of retired instructions per memory system replay trap indicate the frequency of all traps and memory system replay traps, respectively. Fifty retires per trap indicates poor performance and the cause of the traps should be investigated.
The ratio of retired instructions per mispredict indicates the frequency of conditional branch and JSR group (jump) mispredicts. As the analysis in Section 5 shows, a high prediction rate is needed to produce a steady stream of good path instructions. As a rough guideline, a programmer should strive for a minimum of twenty retired instructions per misprediction. Of course, even higher ratios are desirable. Good prediction will bring down the number of aborted instructions since more fetched and issued instructions will be on the good path and they will successfully complete and retire.
The four TLB trap types can skew the number of retired instructions, sometimes making system-level statistics look better than they actually are. This bias should be removed from system-level measures. The four TLB trap types are:
dtbmiss
, PALcode entry:
dtbm_single)
dtb2miss3
,
PALcode entry: dtbm_double_3)
dtb2miss4
,
PALcode entry: dtbm_double_4)
itbmiss
, PALcode entry:
itb_miss)
dcpiprof
image-by-image summary:
In our example, only 145 retired instruction samples were collected for the PALcode image. (The PALcode image is recreated in the DCPI database for convenient access.) When a substantial number of DTB miss or other TLB miss events are recorded, the number of PALcode retires may be quite high. In this case, use theretired cycles % cum% :count % image 193926 78.16% 78.16% 427465 82.61% ecm4c.test 27351 11.02% 89.18% 24005 4.64% /usr/shlib/libc.so 26322 10.61% 99.79% 65532 12.66% /vmunix ... 0 0.00% 100.00% 145 0.03% /dsk0h/dcpidb/PALcode ...
dcpiprof
command:
to produce a procedure-by-procedure breakdown of retired instruction samples attributed to the PALcode image. Here is a short fragment of the breakdown taken from the example:dcpiprof -pm retired /dsk0h/dcpidb/PALcode
The DTB miss (single) routine had only 21 retired instruction samples. Thus, in our example, we can ignore the effect of DTB misses on retires. If the number of retired instruction samples is high, more than one or two percent, then the retired instruction samples for DTB miss should be subtracted out from the number of system-level retired instruction samples. This action will prevent DTB misses from skewing the system-level estimate of retired instructions.retired :count % cum% procedure image 25 17.24% 17.24% swpipl_cont /dsk0h/dcpidb/PALcode 21 14.48% 31.72% dtbm_single /dsk0h/dcpidb/PALcode 17 11.72% 43.45% subr_6c08 /dsk0h/dcpidb/PALcode 15 10.34% 53.79% swpipl /dsk0h/dcpidb/PALcode 11 7.59% 61.38% interrupt /dsk0h/dcpidb/PALcode 10 6.90% 68.28% interrupt_subr_6400 /dsk0h/dcpidb/PALcode 8 5.52% 73.79% rti_cont /dsk0h/dcpidb/PALcode 5 3.45% 77.24% dtbm_double_3_cont /dsk0h/dcpidb/PALcode 5 3.45% 80.69% callsys_cont /dsk0h/dcpidb/PALcode 4 2.76% 83.45% swpctx_cont /dsk0h/dcpidb/PALcode ...
One other component can skew system-level measures -- the operating system idle loop. The idle loop burns cycles, has good ILP and rarely traps. Thus, it can make system-level statistics look too good. Use the command:
to obtain a procedure-by-procedure breakdown of the Tru64 operating system kernel. Here is an excerpt from the example:dcpiprof -pm retired /vmunix | more
The procedure namedretired :count % cum% procedure image 64651 98.66% 98.66% idle_thread /vmunix 104 0.16% 98.81% hardclock /vmunix 47 0.07% 98.89% simple_lock /vmunix 41 0.06% 98.95% ufs_sync_int /vmunix 40 0.06% 99.01% table /vmunix 37 0.06% 99.07% malloc_internal /vmunix ...
idle_thread
is the idle loop. In this
case, 64,651 retired instruction samples have been attributed to the
idle loop out of 65,532 total for the image /vmunix
. The
system-level measure of retired instructions should be reduced by
the number of retired instruction samples attributed to the idle loop.
The measures in this section can be applied at the image, procedure and instruction levels as well as the overall system level. Don't hesitate to apply these measures at finer levels of granularity, too.
Although we have refered to "cycles," "retired instructions," etc. in the last several paragraphs, the terms "cycle samples" and "retired instruction samples" is more precise and correct. Throughout this paper, the word "samples" is usually implied, if not implicitly stated. That is, after all, what we are combining in the numeric formulas.
As mentioned in Section 2, the Alpha 21264 predicts the outcome of conditional branches and jumps in order to fetch instructions beyond them. More precisely, "jumps" are the JSR group of Alpha instructions. The JSR group includes:
The Alpha 21264 uses two different mechanisms to predict the target of conditional branches and JSR group instructions.
Measure | Indication | Goal |
cond br mispredict/retired | Conditional branch misprediction rate | Decrease |
jsr mispredict/retired | JSR group misprediction rate | Decrease |
The ratio of conditional branch mispredictions to retired instructions is the misprediction rate for a conditional branch. The ratio of JSR group mispredictions to retired instructions is the misprediction rate for a JSR group instruction.
The best way to apply these measures is to find individual conditional
branch, return (RET), or subroutine call (JSR) instructions
with a high number of misprediction samples. Follow the procedure
illustrated in the example in Section 1.3 and use dcpiprof
to drill down to frequently mispredicted instructions. Ideally,
individual misprediction rates should be as low as possible (5% or below.)
The ratio of retired instructions per replay trap is usually enough to identify performance loss through memory system replay traps. This section suggests some DCPI/ProfileMe event types that can be used to analyze traps. These events are most meaningful at the instruction level rather than aggregate measures at the system, image or procedure levels.
Measure | Indication | Goal |
ldstorder | Load-store order replay trap | Decrease |
replays | Memory system replay trap | Decrease |
The ldstorder
event indicates that an instruction caused
a load-store order replay trap. This attribution is precise. If the
ldstorder
event is not asserted and the instruction
has replay trap events (replays
) attributed to it, then the
replay trap was one of the following:
The ratio of retired instructions to ITB misses and the ratio of retired instructions to DTB misses indicate the frequency of specific TLB misses. DTB misses are more likely to be a problem than ITB misses, especially in programs using large arrays or a big random-access heap.
It is difficult to set a problem threshold for TLB misses, however, 50-100 retired instructions per DTB miss, for example, would be unacceptably high. Let's take the simple case of a program that walks sequentially through a very large array of long integers (8 bytes per integer.) Each DTB entry supports access to an 8,192 byte page, which is equivalent to 1024 long integers. Assume that the loop is very tight and consists of four instructions:
(4 instructions · 1024 iterations) / 1 DTB miss = 4,096 retired instructions per DTB missThis kind of analysis/reasoning can help you to find your comfort level for a particular program.
Measure | Indication | Goal |
retired/DTB miss | Useful work per DTB miss | Increase |
retired/ITB miss | Useful work per ITB miss | Increase |
nyp/retired | Lower bound I-cache miss rate | Decrease |
retire delay/valid instructions | Average retire delay | Decrease |
The ProfileMe event nyp
is asserted when a profiled
instruction is contained in an aligned 4-instruction I-cache fetch
block that requested a new I-cache fill. The event name is an
acronym for "not yet prefetched," which is a pretty apt description.
The I-cache fill is in response
to an I-cache miss. Unfortunately, the performance
counting hardware is unable to count all I-cache misses. For example,
The nyp
bit is not set if the prefetcher asked for the instruction
a mere cycle before the fetcher needed it. Thus, the number of
nyp
samples is only a lower bound on the total number
of I-cache misses.
With that limitation in mind, the ratio of nyp
samples
to retired instruction samples is an optimistic approximation of
the I-cache miss rate. The true rate is somewhat higher. Unfortunately,
there is no way to estimate the error between the approximation
and the true rate.
Retire delay is a lower bound on the number of processor cycles that
an instruction has delayed its own retirement and the retirement of
the other instructions that follow it in program order. As a measure,
it is most effective at the instruction level. A large retire delay
may indicate the presence of a bottleneck. The ProfileMe retire
delay (retdelay
) is defined only for instructions that
retire without causing a trap (i.e., instructions that are
valid
.)
The Alpha 21264A does not include all cycles in its retire delay count. Thus, the total cycle count is almost always larger than the retire delay count. (Ideally, they would be the same.) Cycles are not counted while:
cycles - retire delayis an estimation of how many cycles are lost to I-cache misses, pipeline flushes and pipeline restarts.
Average retire delay is the total instruction retire delay divided by the number of valid retired instructions (i.e., the number of instructions that retired without trapping.) Average retire delay can be used to identify and analyze long-latency load operations in the following way:
Here's a simple demonstration of the method in action. Assuming that
the procedure function_mn
is an important, hot procedure,
we used dcpilist
to display the retire delay and
retired instruction samples for its instructions. A quick scan of
the annotated disassembly showed a subtract instruction
(subq
) with a large retire delay:
We then reran1 > dcpilist -pm valid:retdelay+retired function_mn example_prof | more function_mn (example.c): ... valid valid :retdelay :count freq 2 112 105 0x12001d180 ldq t9, 144(sp) 18677 100 105 0x12001d184 subq t9, t8, t5 ...
dcpilist
, this time displaying the
average retire delay for each instruction:
The average retire delay of the subtract instruction easily exceeds our working threshold of 20 cycles. The subtract consumes a value produced by a load instruction (2 > dcpilist -pm valid:retdelay::valid function_mn example_prog | more function_mn (example.c): valid :retdelay :: valid :count ... 0.0179 0x12001d180 ldq t9, 144(sp) 186.7700 0x12001d184 subq t9, t8, t5 ...
ldq
) and the value is
passed in register t9
. The load, therefore, is probably
the object of L1 D-cache misses.
In this case, it wasn't hard to find the producing load instruction because the load (the producer) was immediately before the subtract (the consumer.) Detective work may be a little more difficult in real code, although the producer will still pass the value to the consumer in a common register. Since the Alpha 21264A may have up to 80 instructions in flight at once, you may need to look further back in the code, including instructions in a previous iteration of a loop or back through a branch into a different basic block.
Why does the retire delay accumulate on the consuming instruction instead of the load? When the processor begins execution of the load, it evaluates the effective address and puts the load instruction into the load queue where it awaits action by the memory subsystem. If the load does not trap, it is allowed to retire. The processor continues to issue and execute instructions, eventually coming to the consumer. The processor must delay execution of the consuming instruction until the value from the load has been read by the memory subsystem and is available. If the memory system takes a long time to produce the value (i.e., if the value is in the L2 cache or worse, primary memory), then the consuming instruction will be delayed a long time and its retire delay will be high.
Out-of-order execution may hide part of the D-cache miss penalty. Useful work can and sometimes is performed between the load instruction and its consumer. Sometimes one of the intervening instructions suffers a long stall due to an I-cache miss, etc. This time will not be reflected in the retire delay of the consuming instruction.
In this section, we'll show how to apply the basic performance measures.
We will use an example program, ecm4c
, that performs
integer factorization using the Elliptic Curve Method. The program
uses the GNU Multiple Precision (MP) Arithmetic Library to perform
arithmetic on integers of arbitrarily long size.
The default make
procedure for the GNU MP library uses
these C compiler flags on an Alpha 21264A-based platform:
The-g -arch ev6 -tune ev6
arch
and tune
options tell the compiler
to produce and tune code for a 21264-based target platform. The
arch
switch actually implies tune
and is
sufficient in itself.
Use of -g
is more problematic from the performance tuning
point of view. The -g
option controls the level of symbolic
debug information to be generated into the resulting object file/image.
The -g
switch is equivalent to the -g2
option,
which produces full symbolic information for debugging. However,
-g
suppresses optimization (i.e., -g
or -g2
implies the optimization switch -O0
.)
This default build switch will not provide best performance.
We recommend changing the debug flag to -g3
, which
produces symbol table information and optimized code. Due to
compiler optimizations, some correlation is lost between the source
code and the executable program. We chose to change the individual
Makefiles, but the correct place to make this change is in
configure.in
, since the Makefiles are generated from
configure.in
by autoconf and automake.
To emphasize the point, when ecm4c
is built with the
-g
switch, it executes in about 46.6 seconds. When
it is built with -g3
(optimizations enabled),
ecm4c
executes in 41.8 seconds, a savings of nearly
five seconds.
Here are a few tips on the specification of DCPI/ProfileMe
events. The tools dcpiprof
and dcpilist
accept command line options that specify cycle and ProfileMe events.
Processor cycles are counted using an aggregate counter and cycle
samples are selected with the option:
ProfileMe events are selected with the option:-event cycles
where "-pm XXXXX
XXXXX
" may be any ProfileMe event
or combination of ProfileMe events and ratios. Individual ProfileMe event
types are retired
, valid
, nyp
,
cbrmispredict
, ldstorder
, and so forth.
See man dcpiprofileme
for a complete list. Samples for
instructions that did not take a specific ProfileMe event
may be selected with an option of the type:
where the "-pm /YYYYY
/
" indicates "NOT" and "YYYYY
"
may be any ProfileMe event. For example, the option:
selects samples for instructions that did not retire.-pm /retired
The ProfileMe event syntax also permits three kinds of combinations of event specifiers.
+
" combination.
^
" combination.
::
" combination.
This option displays the number of retired instruction samples in one column and displays in a second column, the ratio of instructions that suffered mispredicts, excluding conditional branch mispredicts, to the number of retired instructions.-pm retired+/cbrmispredict^mispredict::retired
The table below is another list of the performance measures. The third column contains the corresponding simple DCPI command line option for each given measure.
Measure | Indication | Option |
cycles | Cycles (time) spent | -event cycles |
retired | Retired instructions | -pm retired |
retired/cycles | Cycles needed to retire instruction | -pm retired -event cycles |
retired/aborted | Useful work vs. wasted effort | -pm retired::/retired |
retired/trap | Useful work per trap | -pm retired::trap |
retired/replay trap | Useful work per replay trap | -pm retired::replays |
retired/mispredict | Useful work per mispredict | -pm retired::mispredict |
cond br mispredict/retired | Conditional branch misprediction rate | -pm cbrmispredict::retired |
jsr mispredict/retired | JSR group misprediction rate | -pm /cbrmispredict^mispredict::retired |
ldstorder | Load-store order replay trap | -pm ldstorder |
replays | Memory system replay trap | -pm replays |
retired/DTB miss | Useful work per DTB miss | -pm retired::dtbmiss |
retired/ITB miss | Useful work per ITB miss | -pm retired::itbmiss |
nyp/retired | Lower bound I-cache miss rate | -pm nyp::retired |
retire delay/valid instructions | Average retire delay | -pm valid:retdelay::valid |
Command lines and output lines can get pretty long. Here are a few helpful suggestions.
retired
can be abbreviated as "ret
" and
cbrmispredict
can be abbreviated as "cbr
".
dcpiprof
and dcpilist
commands
into shell scripts and invoke the scripts instead.
The first step in the system-level assessment is to obtain a
summary of cycle and retired instruction samples. The summary may be
obtained with the dcpiprof
command:
Here, we redirected the summary to a file ("dcpiprof -i -pm retired -event cycles > ret+cyc
ret+cyc
")
so we can refer to the data again later, if necessary. This command
produces the following output:
which shows a system-wide summary at the top and an image-by-image breakdown of cycle samples and retired instruction samples. In the interest of space, we truncated the table, leaving off images that contributed the least to cycles and retires.Column Total Period (for events) ------ ----- ------ cycles 248126 126976 retired:count 517438 126976 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired cycles % cum% :count % image 193926 78.16% 78.16% 427465 82.61% ecm4c.test 27351 11.02% 89.18% 24005 4.64% /usr/shlib/libc.so 26322 10.61% 99.79% 65532 12.66% /vmunix 151 0.06% 99.85% 44 0.01% unknown@positron 126 0.05% 99.90% 108 0.02% /usr/bin/dcpid 67 0.03% 99.93% 19 0.00% /usr/shlib/libpthread.so 55 0.02% 99.95% 59 0.01% /sbin/loader 30 0.01% 99.96% 18 0.00% /usr/shlib/libXt.so 15 0.01% 99.97% 10 0.00% /usr/shlib/libX11.so 11 0.00% 99.97% 2 0.00% /usr/shlib/X11/libdix.so 11 0.00% 99.98% 1 0.00% /usr/shlib/X11/libos.so ...
Overall cycle samples and retired instruction samples may have been
skewed by the kernel idle loop and by PALcode TLB miss handling.
In Section 3.1, we saw that TLB miss handling was inconsequential.
The contribution of the idle loop, however, is substantial. To
check idle loop processing, we used the following dcpiprof
command to show a procedure-by-procedure breakdown of kernel processing:
Idle thread processing accounts for 90.56% of the cycle samples attributed to the operating systen kernel, "dcpiprof -event cycles -pm retired /vmunix > triage.vmunix
/vmunix
."
We need to subtract idle thread cycle samples and retired instruction samples from the system-wide totals to remove the idle loop bias:retired cycles % cum% :count % procedure image 23838 90.56% 90.56% 64651 98.66% idle_thread /vmunix 254 0.96% 91.53% 41 0.06% ufs_sync_int /vmunix 223 0.85% 92.38% 9 0.01% _XentInt /vmunix 153 0.58% 92.96% 47 0.07% simple_lock /vmunix 142 0.54% 93.50% 104 0.16% hardclock /vmunix 54 0.21% 93.70% 37 0.06% malloc_internal /vmunix 54 0.21% 93.91% 12 0.02% pmap_zero_page /vmunix 51 0.19% 94.10% 40 0.06% table /vmunix ...
We will use the unbiased values to compute the other system-level measures. The overall ratio of retired instructions per cycle is 452,787 / 224,288, or roughly 2.02 retired instructions per cycle. This is not too bad for the system as a whole (with a target of 3 retires/cycle), but perhaps there is some room for improvement.cycles 248,126 - 23,838 = 224,288 unbiased cycle samples retired:count 517,438 - 64,651 = 452,787 unbiased retire samples
The next set of measures to compute are the four ratios that indicate
potential areas for investigation (Section 3.1.) The following
dcpiprof
command gives us the raw data that we need:
This command displays the following system-level summary table at the top of its output:dcpiprof -i -pm retired+/retired+trap+replays+mispredict > triage.1
The next step is to compute the four system-level ratios, but using the unbiased count of retired instruction samples:Column Total Period (for events) ------ ----- ------ retired:count 517438 126976 !retired:count 30651 126976 !notrap:count 1724 126976 replays:count 98 126976 mispredict:count 1624 126976
It's interesting to compare these unbiased ratios against idle loop-biased ratios:retired / aborted 452,787 / 30,651 = 14.77 retired / trap 452,787 / 1,724 = 262.64 retired / replays 452,787 / 98 = 4,620.37 retired / mispredict 452,787 / 1,624 = 278.81
Notice that the biased ratios give a better picture of performance than the unbiased ratios.dcpiprof -i -pm ret+ret::/ret+ret::trap+ret::replays+ret::mispredict > triage.2 Column Total Period (for events) ------ ----- ------ retired:count 517438 126976 retired:count::!retired:count 16.881603 NA retired:count::!notrap:count 300.138051 NA retired:count::replays:count 5279.979592 NA retired:count::mispredict:count 318.619458 NA
At this point in the analysis, no major problem is indicated. It may be that statistics for too many different sub-components have been aggregated and problems are hidden. In the next section, we will push the analysis down a level and will take a look at image-level measures.
At the image level, we can get statistics about each of the individual
active images in the system. The dcpiprof
command:
produces a table showing the key overall assessment ratios for each of the active images. Here is an excerpt from that table:dcpiprof -i -pm ret+ret::/ret+ret::trap+ret::replays+ret::mispredict > triage.3
The excerpt contains the most active images. The five images, from top to bottom, are the factorization program (retired retired retired retired :count :count :count :count :: :: :: :: retired !retired !notrap replays mispredict :count :count :count :count :count image 427465 16.2004 320.1985 4857.5568 342.7947 ecm4c.test 65532 152.7552 2184.4000 9361.7143 2978.7273 /vmunix 24005 6.6978 69.7820 NA 69.7820 /usr/shlib/libc.so 145 1.4948 20.7143 NA 20.7143 /dsk0h/dcpidb/PALcode 108 1.6615 27.0000 36.0000 108.0000 /usr/bin/dcpid ...
ecm4c.test
),
the operating system kernel (/vmunix
), the C runtime
library (libc.so
), the PALcode library (PALcode
),
and the DCPI data collection daemon (dcpid
). We will
concentrate on ecm4c.test
because it has the most cycles
and retired instructions. However, the retire/abort ratio for
the C library is quite low and would be worth investigating on its own.
The low ratio of retired instructions to control flow mispredicts
suggests that mispredicts are a problem in the active part of the library
code. The retire/cycle ratio for the C library is 24,005/27,423 or
roughly 0.88, suggesting that much improvement could be made. The
retire/cycle ratio for ecm4c.test
is 427,465/193,926 or
roughly 2.20.
In the case of ecm4c.test
, the high retire/replay ratio
rules out replay traps as a likely cause of lost performance.
Control flow mispredictions, again, are worth investigating at the
procedure and instruction levels. We should also look for long-latency
load instructions, as well.
In this section we will look for the best candidate procedures for instruction-level analysis.
The dcpiprof
command:
produces a table giving the cycle samples and retired instruction samples attributed to each procedure indcpiprof -pm retired -event cycles ecm4c.test > triage.4
ecm4c.test
. Here
are the top procedures in the table:
We edited this table and computed the retire/cycle ratio for each of the procedures to find possible troublemakers. Here is the table:retired cycles % cum% :count % procedure image 73297 37.80% 37.80% 200731 46.96% __gmpn_addmul_1 ecm4c.test 15166 7.82% 45.62% 27936 6.54% mpz_mod_n ecm4c.test 11816 6.09% 51.71% 12357 2.89% __gmpn_sb_divrem_mn ecm4c.test 9365 4.83% 56.54% 23107 5.41% __gmpn_sub_n ecm4c.test 8996 4.64% 61.18% 22006 5.15% __gmpn_add_n ecm4c.test 8450 4.36% 65.54% 14815 3.47% __gmpn_mul_basecase ecm4c.test 7563 3.90% 69.44% 16466 3.85% __gmpz_mul ecm4c.test 7270 3.75% 73.18% 13795 3.23% __gmpz_sub ecm4c.test 6322 3.26% 76.44% 12117 2.83% __gmpn_mul_1 ecm4c.test 6282 3.24% 79.68% 12756 2.98% __gmpz_add ecm4c.test 4770 2.46% 82.14% 9657 2.26% __gmpn_submul_1 ecm4c.test 4594 2.37% 84.51% 9315 2.18% __gmpn_sqr_basecase ecm4c.test ...
The retire/cycle ratio of the top procedure,retired retired::cycles cycles % cum% :count % procedure 73297 37.80% 37.80% 200731 46.96% __gmpn_addmul_1 2.74 15166 7.82% 45.62% 27936 6.54% mpz_mod_n 1.84 11816 6.09% 51.71% 12357 2.89% __gmpn_sb_divrem_mn 1.04 <-- 9365 4.83% 56.54% 23107 5.41% __gmpn_sub_n 2.46 8996 4.64% 61.18% 22006 5.15% __gmpn_add_n 2.45 8450 4.36% 65.54% 14815 3.47% __gmpn_mul_basecase 1.75 7563 3.90% 69.44% 16466 3.85% __gmpz_mul 2.18 7270 3.75% 73.18% 13795 3.23% __gmpz_sub 1.90 6322 3.26% 76.44% 12117 2.83% __gmpn_mul_1 1.92 6282 3.24% 79.68% 12756 2.98% __gmpz_add 2.03 4770 2.46% 82.14% 9657 2.26% __gmpn_submul_1 2.02 4594 2.37% 84.51% 9315 2.18% __gmpn_sqr_basecase 2.03
__gmpn_addmul_1
,
is quite good at 2.74. This procedure is written in Alpha assembler code
and was hand-tuned for the Alpha 21264 (EV6) processor.
The procedure, __gmpn_sb_divrem_mn
, has a retire/cycle
ratio of only 1.04. This definitely needs investigation!
Before shifting focus to the implementation of
__gmpn_sb_divrem_mn
, we evaluated the overall assessment
ratios for each of the procedures in ecm4c.test
using the
command:
Edited output for the most active procedures is:dcpiprof -pm ret+ret::/ret+ret::trap+ret::replays+ret::mispredict ecm4c.test
The procedureretired retired retired retired :count :count :count :count :: :: :: :: retired !retired !notrap replays mispredict :count :count :count :count :count procedure 200731 79.7818 2995.9851 22303.4444 3460.8793 __gmpn_addmul_1 27936 15.4856 410.8235 NA 410.8235 mpz_mod_n 23107 48.2401 700.2121 11553.5000 745.3871 __gmpn_sub_n 22006 81.8067 431.4902 22006.0000 440.1200 __gmpn_add_n 16466 10.7832 191.4651 2744.3333 205.8250 __gmpz_mul 14815 18.8726 493.8333 7407.5000 529.1071 __gmpn_mul_basecase 13795 6.7293 130.1415 766.3889 156.7614 __gmpz_sub 12756 5.6021 103.7073 850.4000 118.1111 __gmpz_add 12357 2.9027 50.8519 1123.3636 53.2629 __gmpn_sb_divrem_mn <-- 12117 15.5946 390.8710 12117.0000 403.9000 __gmpn_mul_1 9657 6.8979 482.8500 NA 482.8500 __gmpn_submul_1 9315 18.9329 4657.5000 NA 4657.5000 __gmpn_sqr_basecase
__gmpn_sb_divrem_mn
is retiring only three
instructions for each instruction that it discards. The low
retire to mispredict ratio suggest that we should
investigate mispredicts within the procedure
__gmpn_sb_divrem_mn
. Because this procedure accounts
for just 6% of the cycles in ecm4c.test
, we will need
to substantially improve its performance in order to improve overall
program performance.
Let's take a look at the conditional branch mispredictions in the
procedure __gmpn_sb_divrem_mn
to see if we can improve
the prediction rate. We used the dcpilist
command:
to display an annotated listing of the proceduredcpilist -both -event cycles -pm cbr::ret __gmpn_sb_divrem_mn ecm4c.test > sb
__gmpn_sb_divrem_mn
. The listing is quite long
so output was redirected to a file (sb
.) This command
produces both a source and assembler listing of the procedure.
The procedure contains a number of conditional branch instructions with poor
misprediction rates.
The conditional branches are associated with the statement:cbrmispredict :count :: retired cycles :count freq ... 6651 0.0234 107 udiv_qrnnd (q, r1, nx, workaround, dx); ... 0 0.0000 105 0x12001b8bc mulq v0, s4, t2 120 0.0000 105 0x12001b8c0 ldq s4, 104(sp) 0 0.0000 105 0x12001b8c4 srl t6, 0x20, t7 0 0.0000 105 0x12001b8c8 mulq v0, s4, t3 0 0.0000 105 0x12001b8cc subq s2, t2, t2 107 0.0000 105 0x12001b8d0 mulq t2, t4, t2 0 0.0000 105 0x12001b8d4 bis t2, t7, t2 0 0.0000 105 0x12001b8d8 cmpult t2, t3, a4 0 0.4414 105 0x12001b8dc beq a4, 0x12001b900 618 0.0000 56 0x12001b8e0 addq t2, s3, t2 0 0.0000 56 0x12001b8e4 lda t1, -1(v0) 0 0.0000 56 0x12001b8e8 cmpult t2, s3, a5 0 0.0000 56 0x12001b8ec cmpult t2, t3, t8 65 0.0000 56 0x12001b8f0 bic t8, a5, a5 0 0.1429 56 0x12001b8f4 blbc a5, 0x12001b900 116 0.0000 7 0x12001b8f8 lda t1, -2(v0) 0 0.0000 7 0x12001b8fc addq t2, s3, t2 1456 0.0000 105 0x12001b900 ldq t12, -31368(gp) 0 0.0000 105 0x12001b904 subq t2, t3, t2 0 0.0000 105 0x12001b908 ldq a1, 112(sp) 0 0.0000 105 0x12001b90c zapnot t6, 0xf, t6 ...
in the source fileudiv_qrnnd (q, r1, nx, workaround, dx);
gmp-3.1.1/mpn/sb_divrem_mn.c
in the
GNU MP library. The code results from the expansion of a rather large
macro, udiv_qrnnd
, which is defined in the source file
gmp-3.1.1/longlong.h
. This required a little bit of
detective work to find and relate the assembler code back to the macro.
The branch at address 0x12001b8dc has a poor prediction rate (about 65%.)
Examination of the macro udiv_qrnnd
relates this conditional
branch to the following statements:
The conditional branch at 0x12001b8dc is generated from the first#define __udiv_qrnnd_c(q, r, n1, n0, d) \ do { \ UWtype __d1, __d0, __q1, __q0, __r1, __r0, __m; \ ... if (__r1 < __m) \ { \ __q1--, __r1 += (d); \ if ((__r1 >= (d)) && (__r1 < __m)) \ __q1--, __r1 += (d); \ } \ __r1 -= __m; \ ...
if
test and the conditional branch at 0x12001b8f4 is
generated from the second if
test. Notice that in the
second case, the compiler has evaluated the &&
as a
bitwise AND operation instead of emitting a branch to implement the
logical AND operation. (If optimization is disabled, the compiler
will emit another branch.) C programmers can use this
optimization explicitly to remove unnecessary branches, for example,
by replacing the conditional expression:
with the bitwise logical expression:((__r1 >= (d)) && (__r1 < __m))
Programmers must be careful to preserve correctness, however, as some conditional expressions depend upon an "early out," as in the case of testing for a NULL pointer before using the pointer elsewhere in the same conditional expression.((__r1 >= (d)) & (__r1 < __m))
The macro code suggests that one or both branches could be eliminated by encouraging the compiler to generate conditional move instructions instead of conditional branches. An Alpha conditional move instruction such as:
transfers the contents of registercmoveq RA, RB, RC
RB
to RC
if the register RA
is equal to zero. We rewrote the macro
code in the following way:
The rewrite adds three new variables: the temporary#define __udiv_qrnnd_c(q, r, n1, n0, d) \ do { \ UWtype __d1, __d0, __q1, __q0, __r1, __r0, __m, __t1; int __p1,__p2;\ ... __p1 = (__r1 < __m); \ __q1 -= __p1; \ __t1 = __r1 + (d); \ __r1 = (__p1 ? __t1 : __r1); \ __p2 = __p1 & (__r1 >= (d)) & (__r1 < __m); \ __q1 -= __p2; \ __t1 = __r1 + (d); \ __r1 = (__p2 ? __t1 : __r1); \ __r1 -= __m; \ ...
__t1
and the two predicate flags __p1
and __p2
. The
new code uses the flags to conditionally increment the variable
__q1
and to guard the transfer of new values into the
variable __r1
. The compiler should generate conditional
move instructions to implement the C language conditional assignments.
The flag __p1
will be the value of the first if
test. The flag __p2
is the logical AND of __p1
and the value of the second if
test. Combining
conditions eliminates both if
tests, resulting
in "straight line" code without branches.
The rewritten code was compiled, executed and measured with DCPI.
The predicate-based macro code compiles into the dcpilist
excerpt below:
The conditional branch instructions have been removed and the overall cycle count for this region of code is lower.retired cycles :count freq ... 2976 6424 107 udiv_qrnnd (q, r1, nx, workaround, dx); ... 0 103 106 0x12001b918 mulq v0, a1, t2 0 100 106 0x12001b91c mulq v0, s4, t3 107 122 106 0x12001b920 srl t5, 0x20, t7 0 83 106 0x12001b924 zapnot t5, 0xf, t6 0 103 106 0x12001b928 subq s2, t2, t2 0 124 106 0x12001b92c mulq t2, t4, t2 108 91 106 0x12001b930 bis t2, t7, t2 0 130 106 0x12001b934 cmpult t2, t3, a4 0 102 106 0x12001b938 addq t2, s3, a5 0 238 106 0x12001b93c cmoveq a4, t2, a5 191 117 106 0x12001b940 subq v0, a4, t1 0 86 106 0x12001b944 cmpule s3, a5, t8 0 95 106 0x12001b948 cmpult a5, t3, t9 0 107 106 0x12001b94c addq a5, s3, t10 156 99 106 0x12001b950 and a4, t8, t8 0 91 106 0x12001b954 and t8, t9, t8 0 216 106 0x12001b958 cmoveq t8, a5, t10 0 120 106 0x12001b95c subq t1, t8, t1
Unfortunately, in this case, the conditional move approach does
not provide the gain that we had hoped for. The table below compares
cycle and retire samples for the two versions of ecm4c
as a whole.
Version | Cycles | Retired |
Baseline | 193,926 | 427,465 |
Predicate | 192,339 | 428,635 |
The predicate version retires more instructions because the processor
maps each conditional move into two internal instructions and thus, each
conditional move appears to be counted twice. Although the number of
cycle samples is slightly lower for the predicate version, it is not
enough to substantially improve execution time. In fact, any improvement
in time is covered by the measurement error and no discernible
improvement in execution time was noted. This is not to say that the
technique is ineffective, but simply that it did not yield an improvement
in this particular case. The overall share of cycles attributed to
gmpn_sb_divrem_mn
decreased by only a small amount,
from roughly 6.09% to 5.24%.
Conditional move instructions can also be manually inserted into
the source code using the asm
built-in function.
Source modules must include /usr/includ/c_asm.h
in order to use
asm
function calls. The include file defines asm
as intrinsic such that the compiler will perform special processing
on any call to asm
.
The first parameter to asm
is a sequence of Alpha
instructions. The instructions follow the syntax illustrated in
c_asm.h
. Further documentation can be found in the
Tru64 UNIX Programmer's Guide. Register usage follows
the Alpha call convention: registers %a0
through
%a5
contain incoming arguments and
register %v0
contains the return value. The function
asm
returns an integer value. Single
and double precision versions, called fasm
and dasm
, are also available.
Here is a version of the macro __udiv_qrnnd_c
that uses asm
to insert conditional moves into
the code stream:
The assembler instructions replace the C language conditional assignment statements that we used in the predicated version of the macro. The__p1 = (__r1 < __m); \ __q1 -= __p1; \ __t1 = __r1 + (d); \ __r1 = asm("bis %r31, %a0, %v0;" \ "cmovne %a1, %a2, %v0;", __r1, __p1, __t1); \ __p2 = __p1 & (__r1 >= (d)) & (__r1 < __m); \ __q1 -= __p2; \ __t1 = __r1 + (d); \ __r1 = asm("bis %r31, %a0, %v0;" \ "cmovne %a1, %a2, %v0;", __r1, __p2, __t1); \ __r1 -= __m; \
bis
instructions above are actually
move operations that initialize the register %v0
to the old value of __r1
. The cmovne
instructions test the predicate flag, and if the flag is non-zero,
move the value of __t1
to the
result register %v0
.
If optimizations are enabled, the compiler will expand the
asm
calls in-line and will merge the instructions
into the generated code stream. This means that registers
other than %a0
, %a1
, etc. will
convey arguments and the result. The substitutions result in
tighter, faster code. If optimizations are disabled, possibly through
the use of the -g
debug switch, then code to
set up %a0
, %a1
, etc. will be
generated, introducing additional runtime overhead. The extra
overhead may then nullify any expected gain in performance.
In this section, we will demonstrate a quick way to find instructions with high average retire delay. High average retire delay may indicate the presence of a long latency load operation.
The fastest way to identify the instructions with the largest average
retire delay is to use the dcpitopcounts
command. The
following command:
will display the top 20 instructions indcpitopcounts -n 20 -pm valid:retdelay::valid ecm4c.test
ecm4c.test
with
the highest average retire delay. The command produces:
We should note that none of these instructions have a big enough average retire delay to suggest a problem with data cache behavior. (Twenty cycles is a reasonable threshold for an Alpha 21264A-based XP-1000 workstation.) However, we will investigate and discuss possible causes for higher average retire delay.retired :retdelay :: retired :count procedure pc instruction 15.00 lucas_cost 120007030 addt $f0,$f1,$f0 14.00 prac 120007348 addt $f0,$f10,$f0 12.00 prac 120007350 stt $f0, 88(sp) 9.00 prac 12000763c br zero, 0x12000791c 9.00 __gmpn_gcdext 120014014 bne t9, 0x120014128 7.80 __gmpz_set 12000fd74 cmplt t0, a1, t0 7.17 __gmpz_tdiv_r 12001567c cvtqt $f1,$f1 7.00 lucas_cost 1200070d0 srl t1, 0x21, t1 7.00 lucas_cost 120007178 srl t1, 0x21, t1 7.00 prac 12000753c srl t4, 0x22, t4 7.00 __gmpn_sb_divrem_mn 12001b934 bis t2, t6, t2 6.57 __gmpn_sb_divrem_mn 12001b8d4 bis t2, t7, t2 6.20 __gmpn_sb_divrem_mn 12001b8cc subq s2, t2, t2 6.01 __gmpn_sb_divrem_mn 12001b92c subq t2, t9, t2 6.00 __gmpz_add 12000c6b0 bsr ra, 0x12000c5a0 6.00 __gmpn_divmod_1_internal 120011c28 ldbu a3, 0(a3) 6.00 __gmpn_gcdext 12001497c stq t1, 160(s6) 6.00 __gmpn_tdiv_qr 120019f58 ldq t4, -8(s3) 5.67 __gmpn_gcdext 120014cd0 subq s0, a4, s1 5.54 __gmpn_mul_1 120015df0 stq t2, 0(a0)
The next step in the analytical process is to examine the context of
each instruction to see if it consumes a value produced by a load
instruction. This can be accomplished by
using dcpilist
to disassemble
and display each instruction, using either the more
command
or a text editor to browse through the disassembled code. Three candidate
instructions are in the procedure lucas_cost
,
which can be disassembled and annotated with the following command:
The first instruction at address 0x120007030:dcpilist -event cycles -pm valid:retdelay::valid lucas_cost emc4c.test > lucas
is a floating point add instruction that consumes the value produced by the double precision divide instruction that immediately precedes it in program order. Floating point divide is a long latency operation (12 cycles single precision, 15 cycles double precision), so it's not too surprising that the consumer was delayed by 15 cycles.lucas_cost: valid :retdelay :: valid cycles :count freq ... 0 4.6667 6 0x12000702c divt $f0,$f17,$f0 7 15.0000 6 0x120007030 addt $f0,$f1,$f0
The next instruction with high average retire delay
in lucas_cost
is at address 0x1200070d0:
This instruction is a left shift that consumes the output of the integer multiple instruction that precedes it at address 0x1200070c4. Integer multiply has a result latency of 7 cycles. It's coincidental that the average retire delay should equal the latency of integer multiply. Some of the latency could have been "hidden" by the instructions between the first multiply and the consuming shift instruction -- instructions that could have been executed while the first multiply was in progress. The third instruction in0 0.9091 10 0x1200070c4 mulq t0, t1, t1 0 0.0000 10 0x1200070c8 zapnot t3, 0xf, t3 0 1.0625 10 0x1200070cc mulq t0, t3, t3 9 7.0000 10 0x1200070d0 srl t1, 0x21, t1
lucas_cost
with high average
retire delay is at address 0x0120007178:
and it also consumes the output of an integer multiply instruction.0 NA 1 0x120007174 mulq t0, t1, t1 0 7.0000 1 0x120007178 srl t1, 0x21, t1
Most of the remaining instructions in the list can be eliminated for one of two reasons:
dcpitopcount
's
list had only 1 or 2 retire samples -- not enough to yield a significant
result.
The instruction at address 0x12000fd74 in the procedure
__gmpz_set
could potentially be the victim of a
long latency load. It consumes the result of a load instruction:
We found the load instruction by tracing back from the instruction at 0x12000fd74 to find the instructions that produced the register operands__gmpz_set (set.c): valid :retdelay :: retired valid :count :count freq ... 2 2.0000 3 0x12000fd44 ldl t0, 0(a0) ... 5 7.8000 3 0x12000fd74 cmplt t0, a1, t0 ...
t0
and a1
. The trace-back of
t0
ended with the load instruction at 0x12000fd44.
Unfortunately, neither instruction was executed enough times to
give much statistical confidence in the result.
The procedure __gmpn_addmul_1
consumed the most cycles
and was clearly the hottest procedure in the ecm4c.test
program. It consists of assembler code
that was carefully tuned for maximum performance on the Alpha 21264.
The hand-tuned assembler code achieves its speed through loop unrolling
and software pipelining. The programmer carefully controlled the
assignment of instructions to functional units in order to keep them
as busy as possible and to minimize instruction stalls.
Its retire/cycle ratio is 2.64, which is near the "theoretical" target
of 3.3 retires per cycle for integer code (assuming that load operations
hit in the L1 D-cache.) We performed some experiments to improve the
code schedule and simply were not able to improve performance through
instruction scheduling.
The procedure __gmpn_addmul_1
has eight code regions
as summarized in the table below.
Region | "Temperature" | Size |
A | Hot | 2 |
B | Warm | 48 |
C | Hot | 12 |
D | Warm | 7 |
E | Cold | 42 |
F | Hot | 50 |
G | Cold | 115 |
H | Hot | 57 |
A region is a sequence of related instructions and may consist of one or more basic blocks. Region "temperature" indicates the relative execution frequency of a region. Hot regions are executed frequently, warm regions are executed infrequently, and cold regions are not executed at all. A performance increase may sometimes be obtained through "hot-cold" procedure optimization, routine splitting and chaining. These optimizations are performed by the tool Spike. Spike rearranges procedures to place frequently called procedures close to each other in memory and rearranges code to put hot regions together. Control flow between regions is changed to permit a straight line flow from one hot region to the next and cold regions are moved out of line. Entry points to basic blocks are padded with NOP instructions to increase fetch efficiency. Overall, these changes improve L1 I-cache performance and instruction fetch behavior of the program.
The table below shows the order of the regions in
__gmpn_addmul_1
after Spike'ing ecm4c.test
.
The hot regions have been pulled together at the top of
the procedure followed by the warm regions. The cold regions have
been moved out of line to a separate area in memory.
Region | "Temperature" |
A | Hot |
C | Hot |
F | Hot |
H | Hot |
B | Warm |
D | Warm |
... | ... |
E | Cold |
G | Cold |
Using Spike reduced execution time to 41.3 seconds from 41.8 seconds,
which was the execution time for the baseline version
of ecm4c.test
.
-event cycles
The number of processor cycles is a measure of time. At the system level, it reflects the total number of processor cycles on a system-wide basis that were needed to complete a task. System level measurements may include cycles that were spent on non-relevant tasks such as unrelated transient interrupts from the network, the user interface, etc. Thus, it is important to take measurements on a quiet system with as little non-relevant activity as possible.
At the image and procedure levels, cycles indicate the amount of time spent executing code within an image or procedure, respectively. Images or procedures that take the most cycles to execute are the most lucrative targets for improvement. Executable components such as shared libraries, PALcode and the operating system may contribute to the total number of relevant cycles as the known application program components themselves.
Cycles at the instruction level indicate the amount of time attributed to the execution of individual instructions. On out-of-order execution machines like the Alpha 21264 or 21364, cycles cannot be attributed precisely and cycle counts indicate the neighborhood of a hot spot. Cycle samples are often attributed to the first instruction in a fetch block (also known as a 16 byte "quad pack.")
Summarizing, if you want to measure all of the cycles
spent running a workload, use cycle samples.
On Alpha 21264 and other out-of-order execution
machines, time that should be charged to one instruction may be
charged as cycle samples to an instruction that is up to 80 instructions
later in program order. Cycle samples are fine for dcpiprof
summaries, but may be confusing at the instruction level.
-pm retired
An instruction is "retired" when the instruction and all preceding instructions (in program order) are past the point at which exceptions and mispredicts are detected. Retirement commits the results of the instruction to the architectural, program state and frees physical registers that are no longer needed. The Alpha 21264 can retire up to eleven instructions in a single cycle. Instructions are "in flight" from the time they are mapped to the time that they are retired.
Because a retired instruction advances the state of program execution, it is, in a sense, a "useful" instruction. It has computed a value that directly contributes to the intended result. Useful computational work has been performed and committed when an instruction retires.
At the system, image, procedure and instruction levels, the number of retired instructions indicates the proportion of successfully executed instructions for a given component. Since a retired instruction is "useful" in the sense defined above, it can be used in ratios to compare the amount of useful work completed versus the number of events that degrade performance.
The following special cases apply to instruction counts.
-event cycles -pm retired
-event cycles -pm retired
The cycle to retired instruction and retired instruction to cycle ratios are clearly related to each other (they're reciprocals of one another.)
The cycle to retired instruction ratio is the number of cycles required to complete and commit a retired instruction. Ideally, this ratio is as small as possible since less time is required, on average, to execute a retired instruction. A ratio of less than one indicates concurrency since an instruction is effectively being completed in less than a full processor cycle.
The retired instruction to cycle ratio is the number of retired instructions that are produced in a single processor cycle. A ratio that is greater than one indicates concurrency.
The Alpha 21264 is able to retire up to eleven instructions in a single cycle. It can sustain a rate of eight retires per cycle for short periods of time. This peak figure is further limited in real code by the availability (or shortage) of physical integer and floating point registers. In integer code with D-cache references, a maximum of 3.3 retires per cycle could be achieved, but is quite unlikely in practice. In floating point code with D-cache references, the maximum is 2.3 floating point instructions per cycle, again, a maximum that is unlikely to be achieved in real code. Thus, a retire/cycle ratio of 3 in integer code and a retire/cycle ratio of 2 in floating point code are exceptionally good. If the ratio is in the neighborhood of one retire per cycle or less, then further improvement is possible and investigation is warranted.
The sampling period for cycle and retired instruction events must be equal, or the sample counts must be normalized to account for the different sampling periods. This requirement must be satisfied for all composite performance measures (e.g., ratios.) The sampling "weight" of one event in a composite measure must be the same as the "weight" of any other event in the measure.
-pm retired::/retired
Instructions that did not retire are aborted instructions. Aborted instructions did not complete and did not advance the program state. Instructions may not complete for a variety of reasons such as mispredicted control flow and memory system replay traps. Instructions caught in the shadow of a trap will not complete and are aborted.
Aborted instructions consume pipeline resources. Depending on how
far an aborted instruction has progressed, it may consume an
instruction number (Inum), a physical register for its result,
an instruction queue slot, and a functional unit.
An instruction may be aborted (killed) early in the pipeline, before
it enters an issue queue. An instruction may be aborted (killed) late
in the pipeline, after it enters an issue queue. The early_kill
and late_kill
ProfileMe event types can be used to measure
the number of early and late kills, respectively. Instructions that are
killed late in the processor pipeline are more costly than early kills.
The ratio of retired to non-retired (aborted) instructions indicates the relative number of useful, completed instructions to wasted, discarded instructions. Naturally, a high ratio is desired because that indicates efficient use of pipeline resources.
-pm retired::trap
The pipeline traps when the hardware has detected a condition that must be corrected. Two common kinds of traps are control flow mispredictions and memory system replay traps. (These will be explained a little later.) The processor must restore its internal state to a point from which recovery can be performed. This often means aborting certain inflight instructions, thereby discarding intermediate results.
The ratio of retired instructions to the number of traps indicates the relative frequency of instruction traps. Since traps are wasteful, a high ratio is desirable (i.e., relatively infrequent pipeline traps.)
A pipeline trap is not the same as a software trap. Pipeline traps correct internal processor conditions to maintain architecturally correct execution of a program. Software traps yield execution to the operating system via a much different mechanism.
-pm retired::replays
Memory system replay traps respond to dynamic conditions that
are detected in the memory system. Certain conditions cannot be
detected statically by analyzing the program stream. Sometimes,
these conditions depend upon the effective address computed by
a load or store instruction. For example, the hardware cannot
detect a producer-consumer dependency between a store instruction
and a consuming load until it evaluates both effective addresses
and realizes that they are the same address. Due to aggressive
instruction scheduling, the processor may need to change the
order of execution of the load and store to preserve correctness.
This kind of replay trap has a special name, load-store order
replay trap, and is reported by the DCPI/ProfileMe event name
ldstorder
.
The ratio of retired instructions to replay traps is an indicator of the relative frequency of replay traps. Since replay traps result in wasted (aborted) instructions, they should be avoided.
Be careful to exclude the operating system idle loop from system level analysis. The idle loop gets good ILP and doesn't trap much.
-pm retired::mispredict
The Alpha 21264 pipeline needs a steady incoming flow of instructions to keep its functional units and memory subsystem busy. In regions of "straight-line" code with no branches, the processor can freely fetch ahead as fast as possible to keep the flow of instructions coming. When the processor encounters a conditional branch or JSR group instruction (jump, jump to subroutine, return from subroutine, and jump to coroutine) it must make a decision as to where to continue fetching instructions. The processor uses recent program execution history to predict the location of the target and to continue fetching from there. Instructions beyond the conditional branch or JSR group instruction are fetched and enter the back end of the pipeline beginning execution.
If the processor predicts the conditional branch or jump correctly, the instructions will execute, complete and retire. If the processor predicts incorrectly, the instructions beyond the conditional branch or JSR group instruction must be discarded (aborted.) Clearly, accurate prediction is important to good performance.
At the system and image levels, the ratio of retired instructions to mispredicted instructions is a rough indicator of prediction accuracy. The mispredict and retire data at the system and image levels is aggregated across all conditional branches and jumps in the program. Exceptionally good prediction accuracy in some code/loops may mask the presence of performance degrading mispredicts in other code/loops. Thus, at the system and image levels, this ratio is not a perfect indicator, but is still somewhat useful. The ratio performs better at the procedure and instruction levels, especially if each procedure is small and is not subject to masking due to aggregation of sample data.
At the instruction level, the DCPI/ProfileMe mispredict
event is
valid for conditional branch, JSR, RET, JMP, and JSR_COROUTINE
instructions only. The ratio is the misprediction rate for the
instruction. The difference,
1 - (mispredict/retired),is the prediction rate or prediction accuracy for the instruction.
High prediction accuracy is needed to provide a "window" of useful instructions for execution. For a given prediction accuracy, p, and basic block size, B, the effective "window" size is:
B + p · B + p² · Bwhen fetching through two conditional branches. Basic block size is typically 5 to 6 instructions. The Alpha 21264 may have up to 80 instructions in flight at once. In order to achieve this, prediction accuracy must be very high in code with many conditional branches and short basic blocks. For example, this requires 99%+ accuracy when fetching through 10 branches with a 6 instruction basic block size. If the basic block size doubles to 12 instructions, less accuracy is required (93%.)
-pm cbrmispredict::retired
The DCPI/ProfileMe event cbrmispredict
applies to
conditional branch instructions. The ratio of conditional branch
mispredicts to retired instructions is the conditional branch
misprediction rate. An aggregate conditional branch
mispredict rate can be computed at the system, image and procedure
levels. Again, exceptionally good performance in some loops/code
may mask problems in other loops/code due to aggregation.
At the instruction level, the ratio is the misprediction rate for individual conditional branch instructions. The Alpha 21264 branch prediction algorithm is table-driven. Conditional branch mispredicts are due to:
-pm /cbrmispredict^mispredict::retired
As noted above, the DCPI/ProfileMe mispredict
event
is valid for both conditional branch instructions and JSR group
instructions. Unlike conditional branch misprediction events, a
separate event is not provided for JSR group misprediction events.
Instead, the subset of JSR group misprediction events can be
derived from all mispredictions using the
event expression:
/cbrmispredict^mispredict
This expression selects the sample data for mispredict events
(mispredict
) which are not conditional
branch mispredict events (/cbrmispredict
.) The
sample count for the subset is then divided by the retire count,
obtaining the rate in the usual way.
Interpretation of the JSR group misprediction rate is analogous to the interpretation of the conditional branch misprediction rate.
-pm retired::dtbmiss
-pm retired::itbmiss
A translation look-aside buffer (TLB) is a small memory that caches recently used virtual to physical address mapping information. A TLB is a table where each entry describes the mapping of a virtual memory page to a physical memory page. (The entry can handle multiple contiguous pages if granularity hints are used.)
The Alpha 21264 has two TLBs: an Instruction Translation Buffer (ITB) and a Data Translation Buffer (DTB.) The ITB handles instruction fetches and the DTB handles data reads and writes. The ITB and DTB each have 128 entries. If the page size is 8,192 bytes, then each Alpha TLB has a 1Mbyte reach, that is, each TLB provides a 1Mbyte "window" into the virtual memory space.
When information for a virtual address cannot be found in the appropriate TLB, then one of the entries must be updated with information for the new target page. The Alpha PALcode is responsible for handling the TLB miss and for updating the TLB with the new page information.
The ratio of retired instructions to ITB misses and the ratio of retired instructions to DTB misses indicate the relative frequency of ITB and DTB misses, respectively. Since ITB and DTB miss handling is relatively expensive, higher ratios (infrequent ITB and DTB misses) are desirable.
When computing DTB and ITB miss ratios at the system level, do not include the number of retired instructions attributed to the ITB/DTB miss handling routines in the PALcode. The number of retired instructions in the miss handling routines should be subtracted from the total number of retired instructions as miss handling instructions are computational overhead. Failure to do so will include extra retired instructions in the numerator of the ratio and will make the situation appear to be better than it is.
-pm retired::nyp
The Alpha 21264A does not provide a ProfileMe event for L1 I-cache
and D-cache misses. The DCPI/ProfileMe event nyp
is
asserted when a profiled instruction is contained in an aligned
4-instruction I-cache fetch block that requested a new I-cache fill
stream. Thus, "nyp
" means "not yet prefetched."
The counters are blind to many I-cache misses; the nyp
bit is
not set if the prefetcher asked for the instruction a mere cycle before
the fetcher needed it, for example. Therefore, nyp
undercounts I-cache miss events and is a lower bound on the
number of I-cache misses.
The ratio of the number of retired instructions per instructions not yet prefetched is a rough indicator of the frequency of L1 I-cache misses. The reciprocal ratio:
nyp / retiredis a rough approximation of the I-cache miss rate. A low I-cache miss rate is a goal for good performance.
-pm valid:retdelay::valid
An instruction's retire delay is a lower bound on the number of CPU cycles that the instruction has delayed its own retirement and the retirement of other instructions following it in program order. High retire delay may indicate the presence of a performance bottleneck.
Ideally, the total, system-level retire delay would equal the total number of cycles, since retire delay would charge all spent cycles to instructions. Unfortunately, the Alpha 21264A has "blind spots" in its performance counters and total retire delay is less than total cycles. (During a blind spot, the hardware event counter stops incrementing and does not count events.) On Alpha 21264A, retire delay has blind spots that effectively exclude most of the cycles spent aborting instructions and refilling the pipeline after a pipeline trap. Retire delay, for example, is blind to I-cache miss stalls. Retire delay (retdelay) is not blind to load misses, however, making retire delay a useful tool in assessing load latency.
The gap between cycles and retire delay is given by the formula:
cycles - retire delayThis is a good estimate of the cycles spent:
Retire delay is especially valuable for finding high latency load operations. The ratio:
retired delay / number of valid instructionsgives the average retire delay. The ratio can be used at the instruction level to identify load-miss opportunities (i.e. long latency load operations.) To find load-miss opportunities, start with the command:
This command will sort instructions by total retire delay and the output can be further used to filter out only the instructions whose average retire delay exceeds a chosen threshold. Next, examine the candidates and find instructions whose register operands are produced by a preceding load instruction (a load-use pair.) That will identify instructions that are:dcpitopcounts -pm valid:retdelay+valid:retdelay::valid