Performance tuning is the process of identifying and removing bottlenecks. Bottlenecks can arise from many different sources. Here are but a few examples.
DCPI/ProfileMe runs in the background and collects performance information about a program as it executes. In fact, DCPI/ProfileMe collects information about all running processes, giving the programmer a system-wide view. This broad view assists the analysis of applications that may be distributed across two or more processes.
Here is a short example of DCPI/ProfileMe in action. It shows
how to run DCPI/ProfileMe and the kind of information produced by
the its analysis. We'll run and analyze the following program,
which computes the value of e raised to the x
power for x between 0.0 and 10.0. This is not a particularly
good implementation of the Taylor series approximation, but it
should burn about one minutes worth of execution time. (The
execution platform is a 667MHz XP-1000 workstation with 256Mbytes of
primary memory and a 4Mbyte level 2 cache, running the Tru64 UNIX
operating system.) The function main
consists of a loop that calls e_to_the_x
with
successive values of x. The function e_to_the_x
evaluates a Taylor series polynomial with 40 terms.
#include <stdio.h> #define NTERMS 40 #define NITERATIONS 100000000 double e_to_the_x(double x) { long int i ; double factorial, term, result ; result = 1.0 ; factorial = 1.0 ; term = 1.0 ; for (i = 1 ; i <= NTERMS ; i++) { term = term * x ; factorial = factorial * (double) i ; result = result + (term / factorial) ; } return( result ) ; } main(int argc, char *argv[]) { int i ; double ex, x, step ; x = 0.0 ; step = 10.0 / ((double) NITERATIONS) ; for (i = 0 ; i <= NITERATIONS ; i++) { ex = e_to_the_x(x) ; if ((i == 0) || (i == NITERATIONS)) { printf("e^%f: %f\n", x, ex) ; } x = x + step ; } }
The session transcription sets the environment variable
DCPIDB
to the path to the DCPI measurement
database, starts the DCPI/ProfileMe collection system,
runs the program (taylor
), and
stops data collection.
1 > setenv DCPIDB /dsk0h/dcpidb 2 > dcpid -event pm $DCPIDB dcpid: load : loaded pcount driver dcpid: monitoring pm dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log 3 > taylor e^0.000000: 1.000000 e^10.000000: 22026.465709 4 > dcpiquit dcpictl: quit successful
Collecting data is just that easy. Data collection adds about 2 to 3 percent overhead to program execution. It does not require recompilation/relinking, nor does it invade the program in any way.
DCPI/ProfileMe provides several tools for analyzing and displaying
performance information. This paper concentrates on the use of
two of these tools, dcpiprof
and dcpilist
.
The basic analytical process involves:
dcpiprof
transcription below shows an overview of
the execution of taylor
. DCPI/ProfileMe collects and
summarizes hardware events that occur during program execution. The
summary below shows the distribution of retired (completed) instructions
across the program images that were active during the measurement period.
The program image taylor
has the largest number of
retired instructions indicating that it was the busiest component and
is a likely candidate for further investigation.
5 > dcpiprof -i -pm retired Column Total Period (for events) ------ ----- ------ retired:count 815091 63488 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired :count % cum% image 667207 81.86% 81.86% taylor 147151 18.05% 99.91% /vmunix 351 0.04% 99.95% /dsk0h/dcpidb/PALcode 178 0.02% 99.97% /usr/bin/dcpid 79 0.01% 99.98% /usr/shlib/libXt.so 49 0.01% 99.99% /usr/shlib/libc.so 18 0.00% 99.99% /usr/shlib/libX11.so 15 0.00% 99.99% /usr/shlib/libpthread.so 11 0.00% 100.00% /usr/shlib/X11/libdix.so 8 0.00% 100.00% /usr/shlib/libXm.so 5 0.00% 100.00% unknown@positron 4 0.00% 100.00% /usr/shlib/libmach.so 3 0.00% 100.00% /usr/shlib/X11/libos.so 2 0.00% 100.00% /usr/shlib/X11/lib_dec_ws.so 2 0.00% 100.00% /usr/sbin/routed 2 0.00% 100.00% /usr/sbin/xntpd 2 0.00% 100.00% /sbin/loader 1 0.00% 100.00% /bin/sh 1 0.00% 100.00% /usr/dt/bin/dtterm 1 0.00% 100.00% /usr/shlib/X11/libmi.so 1 0.00% 100.00% /usr/bin/csh
The dcpiprof
tool is also used to produce a procedure-by-procedure
breakdown of DCPI/ProfileMe events for an image. The excerpt below
shows the breakdown of retired instructions for each of the functions
in taylor
. The function e_to_the_x
is the most
active procedure and is worth a closer look.
6 > dcpiprof -pm retired taylor Column Total Period (for events) ------ ----- ------ retired:count 667207 63488 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired :count % cum% procedure image 646743 96.93% 96.93% e_to_the_x taylor 20464 3.07% 100.00% main taylor
The dcpilist
program shows the instruction-by-instruction
breakdown of DCPI/ProfileMe events for a particular procedure.
dcpilist
displays the breakdown as an annotated
assembler (or source language) listing. The annotations show
the number of selected DCPI/ProfileMe events associated with
each instruction. The transcription below shows the number
of retire events and retire delay for each instruction in
e_to_the_x
. An instruction retires when it
successfully executes to completion in the CPU pipeline.
An instruction's retire delay is a lower bound on the number
of CPU cycles that the instruction has delayed its own
retirement and the retirement of other instructions
following it in program order. High retire delay may indicate
the presence of a performance bottleneck.
7 > dcpilist -pm retired+retired:retdelay e_to_the_x taylor e_to_the_x: retired retired :count :retdelay freq 0 0 1122 0x120001160 ldah gp, 8192(t12) 0 0 1122 0x120001164 ldq_u zero, 0(sp) 0 0 1122 0x120001168 lda gp, 28448(gp) 0 0 1122 0x12000116c ldq_u zero, 0(sp) 1561 0 1122 0x120001170 ldah at, -8192(gp) 1604 0 1122 0x120001174 lda sp, -16(sp) 1661 5 1122 0x120001178 lds $f1, -28888(at) 1587 0 1122 0x12000117c bis zero, 0x1, v0 1589 0 1122 0x120001180 cpys $f1,$f1,$f0 1567 0 1122 0x120001184 cpys $f1,$f1,$f10 1534 0 1122 0x120001188 ldq_u zero, 0(sp) 1526 1526 1122 0x12000118c ldq_u zero, 0(sp) 63089 0 63103 0x120001190 : stq v0, 0(sp) 63134 0 63103 0x120001194 mult $f1,$f16,$f1 63248 0 63103 0x120001198 lda v0, 1(v0) 63105 1 63103 0x12000119c cmple v0, 0x28, t0 62998 0 63103 0x1200011a0 ldt $f11, 0(sp) 62537 8 63103 0x1200011a4 cvtqt $f11,$f11 62995 30 63103 0x1200011a8 mult $f10,$f11,$f10 63413 63442 63103 0x1200011ac divt $f1,$f10,$f11 63415 625262 63103 0x1200011b0 addt $f0,$f11,$f0 63097 63125 63103 0x1200011b4 bne t0, 0x120001190 1557 0 1122 0x1200011b8 lda sp, 16(sp) 1526 4 1122 0x1200011bc ret zero, (ra), 1
The relative distribution of event counts and counter values often
indicates the presence of a performance bottleneck. In the example above,
the high retire delay at address 0x120000900
is due to the relatively high latency of the floating point divide
instruction (divt
.) The floating point add instruction
(addt
) must wait for the quotient in register f10
.
This paper is for experienced C programmers who would like to tune their programs for execution on Alpha. FORTRAN programmers with a reading knowledge of C will also benefit. The paper concentrates on the ProfileMe features as supported by contemporary Alpha processors like the 21264.
Performance analysis with DCPI/ProfileMe is not for the casual user. A programmer must be willing to explore the relationships between source code and compiled code, and between the program and the behavior of the Alpha processor and its associated memory system. A reading knowledge of Alpha assembler language is helpful. A programmer should at least be able to identify load, store, computational and branch instructions, and to follow the flow of data between instructions via the Alpha integer and floating point registers.
If you're ready to forge ahead, here is the roadmap for the rest of the paper. Section 2 is a detailed introduction to the behavior of the Alpha 21264 processing pipeline. The information in Section 2 is needed to understand the meaning of certain DCPI/ProfileMe hardware events and may be skipped on a first reading. Section 3 describes the way DCPI/ProfileMe collects performance information. It also contains several tables that summarize the kinds of events and information that DCPI/ProfileMe can collect.
After these preliminaries, Section 4 discusses the "mechanics" of using DCPI/ProfileMe in more detail. Section 4 should get you running DCPI/ProfileMe and experimenting with programs of your own. Section 5 digs into the interpretation of DCPI/ProfileMe data for specific Alpha processor and memory system features. Section 5 covers broad performance indicators, branch prediction behavior, replay traps, memory system behavior and data alignment in memory. The section concludes with a brief description of relatively rare behavioral issues.
The Alpha 21264 processor achieves high performance by identifying and exploiting instruction level parallelism (ILP.) ILP is identified by analyzing dependencies between instructions. ILP is exploited by removing false dependencies and issuing instructions for execution when they are data-ready. Instructions may issue and execute out of program order although results are committed in program order to maintain program and architectural correctness. The 21264 provides several computational units (four integer and two floating point) that operate in parallel to perform computations concurrently when possible.
Let's illustrate this rather dense paragraph with a short example. The following assignment statement:
e ← (a + b) * (c + d)
may be compiled into the sequence of instructions:
If the values ofaddq $16, $17, $20 # Where register $16=a and $17=b addq $18, $19, $21 # Where register $18=c and $19=d mulq $20, $21, $22 # Putting the result, e, into register $22
a, b, c,
and d
are available
in their respective registers (that is, the data is ready), then the two
additions could be performed concurrently on two separate functional
units, assuming that the units are also available and ready. The
multiply depends upon the results produced by the two additions and
must be performed after the additions. The 21264 identifies the
opportunity to execute the two additions in parallel and assigns
the computational resources needed to do the job. It also recognizes
that the data dependencies between the add instructions and the
multiply instruction must be preserved and performs the multiply
after the addition instructions.
Instructions pass through several stages of execution as they find their way through the 21264 pipeline. Alpha operate (register-to-register), load, and store instructions go through five common stages (stages 0 through 4) as shown in the table below. Processing for these main instruction types becomes quite different in subsequent stages (stage number 5 and beyond.)
Stage | Purpose |
0 | Instruction fetch |
1 | Instruction slotting |
2 | Register map |
3 | Instruction issue |
4 | Register read |
5 | Execution |
... | Retirement |
Here is a brief rundown on the common instruction stages. This is a simplified description and we will add details later when considering performance.
Operate instructions are integer and floating point instructions that perform register-based operations. To achieve high speed, each computational cluster (two integer and one floating point) has its own register file. The two integer register files must be kept up to date and consistent by exchanging results. Operate instructions pass through the following sequence of stages:
Due to resource constraints in the integer execution unit, up to two load instructions can be issued per clock cycle. Stages in load processing are:
Three components help implement access to L2 cache and primary memory:
Store instructions are executed in the following way.
Throughput depends upon a fast, smooth flow of instructions and data into, through and out of the processor pipeline. The 21264 incorporates features to increase and smooth the flow of data and instructions. For example, the 21264 fetches instructions aggressively, even by fetching through conditional branches. When a conventional (more precisely, "non-speculative") processor encounters a conditional branch instruction, it awaits the final outcome of the branch condition to see where the flow of control will go next before fetching new instructions. The 21264, however, uses branch prediction to speculate on the control flow through conditional branches and continues to prefetch instructions based upon its prediction.
Resource | Capacity |
Integer instruction queue | 20 entries |
Floating point instruction queue | 15 entries |
Physical integer registers | 80 registers |
Physical floating point registers | 72 registers |
Integer functional units | 4 units |
Floating point functional units | 2 units |
Load queue | 32 entries |
Store queue | 32 entries |
Miss address file | 8 entries |
Instruction Translation Buffer | 128 entries |
Data Translation Buffer | 128 entries |
Performance is lost when the smooth flow of instructions or data is disturbed. The flow may be disturbed in several ways.
Pipeline abort condition | Penalty (in cycles) |
Branch misprediction | 7 |
JSR misprediction | 8 |
Memory order replay trap | 14 |
Other memory replay traps | 13 |
DTB miss | 13 |
ITB miss | 7 |
Integer arithmetic trap | 12 |
Floating point arithmetic trap | 13+(instruction latency) |
The Compaq Continuous Profiling Infrastructure (DCPI) and ProfileMe collect and analyze program profile data. The profiles provide information about dynamic program behavior and can be used to detect and identify performance bottlenecks. ProfileMe is actually a part of DCPI, but is a necessary departure from the profiling facilities originally provided by DCPI. Although part of DCPI, we will refer to ProfileMe separately since it deals with different kinds of events.
Both DCPI and ProfileMe use a combination of hardware and software to collect and analyze performance data. The data is collected by sampling internal processor registers that count or monitor performance-related instruction events. The samples are written to a database and are organized by measurement epoch (the time period over which samples were collected), host processor, process and binary program image. Samples from the database are postprocessed by DCPI/ProfileMe tools. The tools analyze, summarize and present performance data by image, procedure (function), and instruction. The tools help a developer to find and drill down to hot spots and potential performance problems in a program.
Sampling has a number of advantages. First, sampling is non-invasive and is transparent. No instrumentation code needs to be compiled or inserted into a program to collect data. The application program can be run in the normal fashion and DCPI/ProfileMe collects data in the background. Overhead is low ranging from a 0.5% to 3.0% slowdown. Next, information can be collected about the entire system as it executes. DCPI/ProfileMe collects data about application programs, shared libraries, device drivers and operating system. Any component in the system is potentially a performance culprit. Finally, sampling and profiling provides performance information right down to the instruction level. This helps a programmer to identify code sequences or program behavior that are not in tune with the pipelined organization of Alpha processors or memory system. These insights can be used to change compiler options, optimization techniques, source code and algorithms to remove bottlenecks and achieve higher performance.
DCPI and ProfileMe use two different sampling techniques and the interpretation of some performance event data is different. DCPI was originally written to support in order processors such as the Alpha 21064 and the Alpha 21164. Program counter values and hardware performance registers are periodically sampled. Over time, event counts pile up, statistically, of course, on instructions. The "hottest" instructions have the highest event counts such as processor cycles, I- or D-cache misses, etc. The software tools are able to resolve certain timing problems to correctly associate event samples with instructions.
The following table shows the event types supported by both
the Alpha 21064 and 21164. The 21064 and 21164 each support additional
event types. The 21164, which is still in wide use, includes event
types for ITB/DTB misses, L3 (B-) cache behavior, replay traps, and
single-, triple- and quad-issue. Please see the dcpid(1)
man page for further details.
Event type | Description |
cycles | Processor cycles |
issues | Instruction issues |
nonissue | Non-issue cycles |
imiss | Instruction cache misses |
dmiss | Data cache misses |
branchmp | Branch mispredicts |
flow | Flow control changes |
pipelinedry | Pipeline dry cycles |
issue2 | Cycles with 2 issues |
intop | Integer operations (excluding loads/stores) |
fpop | Floating point operations (excluding loads/stores) |
load | Load instructions |
store | Store instructions |
Things changed with the Alpha 21264. Out-of-order execution makes the exact order of instruction execution indeterminate. Measurements taken by sampling the program counter skew and smear making the assignment of event sample to instruction very difficult to impossible. ProfileMe was added, and instead, it uses instruction sampling to collect event data. At the fetch stage, an instruction is randomly selected for profiling. Information about execution events is collected as the instruction passes through the pipeline. This information is precise, it can be directly attributed to the program counter value and it provides a record of what happened to the instruction during its execution.
Two major types of 21264 processors have been manufactured to date: the 21264 and 21264A. The Alpha 21264 (EV6) incorporates program counter sampling. The Alpha 21264A (EV67) implements ProfileMe instruction sampling. The following table summarizes the events supported and sampled in the 21264. Due to smear and skew, samples cannot be ascribed to a particular instruction with certainty.
Event type | Description |
cycles | Processor cycles |
retires | Retired instructions |
replaytrap | Memory system replay traps |
itbmiss | Retired ITB misses |
dtbmiss | Retired DTB misses |
dtbdblmiss | Retired DTB double misses |
retcondbr | Retired conditional branches |
retunalign | Retired unaligned traps |
The table below summarizes the instruction events recorded by the
Alpha 21264A. These events are recorded for each instruction that is
selected for sampling. Since the program counter value for a sampled
instruction is known, event attribution is precise. The taken
and cbrmispredict
event types are valid for conditional
branch instructions only. The ldstorder
event type is
valid for only load and store instructions.
Event type | Description |
retired | The instruction retired |
valid | The instruction retired and did not cause a trap |
taken | The conditional branch was taken |
cbrmispredict | The conditional branch was mispredicted |
nyp | Not yet prefetched |
ldstorder | The instruction caused a load-store order replay trap |
map_stall | The instruction stalled after fetch and before mapping |
early_kill | The instruction was killed before it entered an issue queue |
late_kill | The instruction was killed after it entered an issue queue |
ProfileMe also records and collects information about instruction traps. The data for each instruction includes a set of trap bits that indicate the cause of an instruction trap if one occurred during the attempted execution of the instruction. The table below lists the 21264A ProfileMe trap bits. Only one of these bits will be asserted (set) in a given ProfileMe sample.
Event type | Description |
notrap | No trap occurred |
mispredict | The instruction caused a mispredict trap |
replays | The instruction caused a replay trap |
unaligntrap | The instruction caused an unaligned load or store |
dtbmiss | The instruction caused a DTB single miss |
dtb2miss3 | The instruction caused a DTB double miss (3 level page tables) |
dtb2miss4 | The instruction caused a DTB double miss (4 level page tables) |
itbmiss | The instruction caused on instruction TLB miss |
arithtrap | The instruction caused on arithmetic trap |
fpdisabledtrap | The instruction caused a floating point disabled trap |
dfaulttrap | The instruction caused a data stream page fault |
iacvtrap | The instruction caused an instruction stream access violation |
OPCDECtrap | The instruction attempted to execute a reserved instruction or privileged instruction |
interrupt | The instruction was preempted by an interrupt |
ProfileMe support in the 21264A includes two programmable event counters. The table below is a list of ProfileMe events that can be measured using these counters. We will discuss all of these features in more detail later. Although the list of event types, trap bits and counter values is daunting, only a few data items are needed at any one time to answer a specific performance question.
Counter value name | Description |
retdelay | Instruction retire delay |
inflight | Number of cycles the instruction was inflight |
bcmisses | Number of B-cache (L2) cache misses |
replays | Number of memory system replay traps |
This section discusses the "mechanics" of analyzing a program with DCPI/ProfileMe. We will introduce a few of the most important tools, concentrating on 21264A and ProfileMe, and show how to use them. We'll also take a quick tour of the DCPI database.
The table below lists the principal tools in the DCPI/ProfileMe
suite. Be sure to consult the man pages as they contain many
important details. The man
page
dcpid(1)
contains a full list of
event types supported by 21064, 21164 and 21264, and is a handy
on-line resource. The man
page
dcpiprofileme(1)
describes the
use of DCPI to collect and analyze ProfileMe data on 21264A.
It includes a description of ProfileMe events, counters and trap bits.
Program | Description |
dcpid | DCPI daemon (starts measurement) |
dcpiflush | Flushes samples to the DCPI database |
dcpiquit | Stops the daemon (stops measurement) |
dcpiepoch | Starts a new profiling epoch |
dcpilabel | Assigns a label to the profiles for a command |
dcpiprof | Summarize profile data by image or procedure |
dcpilist | Display line-by-line profile data for a procedure |
dcpiwhatcg | Where have all the cycles gone in an image? |
dcpicalc | Calculate cycles-per-instruction for procedures |
dcpitopstalls | List the instructions with the most stall cycles |
dcpicat | Prints the contents of a profile file |
Performance analysis with DCPI/ProfileMe consists of five basic steps.
To illustrate these steps, we will use a short program taken from
morphological image processing. The program applies a structuring
element to each of the pixels in an image to dilate its features.
The heart of the program consists of the two functions
DilateImage
and DilatePixel
(shown below.)
DilateImage
steps across the image and calls
DilatePixel
to apply the structuring element to the pixels.
DilatePixel(r, c) int r, c ; { register int i, j ; int sum, the_max ; the_max = MINUS_INFINITY ; for (i = 0 ; i < 31 ; i++) { for (j = 0 ; j < 31 ; j++) { sum = A[r+i][c+j] + B[i][j] ; if (sum > the_max) the_max = sum ; } } Z[r][c] = the_max ; } DilateImage() { register int r, c ; for (r = 0 ; r < IMAGE_ROWS ; r++) { for (c = 0 ; c < IMAGE_COLS ; c++) { DilatePixel(r, c) ; } } }
We will analyze the dilation program using DCPI/ProfileMe.
Five simple steps are needed to collect data:
1 > cc -O2 -arch ev6 -o dilate dilate.c 2 > setenv DCPIDB /dsk0h/dcpidb 3 > dcpid -event pm $DCPIDB dcpid: load : loaded pcount driver dcpid: monitoring pm dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log 4 > dilate heart.ima ball out.ima Dilation (dilate) untimed `spec' version Image file (heart.ima) read. Structuring element file (ball) read. Output image file (out.ima) written. Leaving program (dilate) 5 > dcpiquit dcpictl: quit successful
The cc
command at line 1 compiles the source code
for the dilation program at optimization level 2. The program is
compiled for the EV6 (Alpha 21264) target architecture. The command at line 2
creates an environment variable, DCPIDB
, which allows
us to more conveniently enter the path to the DCPI database directory.
The database directory is located at /dsk0h/dcpidb
.
The dcpid
command on line 3 starts the DCPI data collection daemon.
The dcpid
command offers the most options. The slot (event)
specifier, -event pm
, tells DCPI
to collect ProfileMe data and to write profile data into the
database directory given by the environment variable DCPIDB
.
The command on line 4 runs the application program. The command on line 5,
dcpiquit
, flushes any remaining samples to the
database and stops the DCPI daemon.
In addition to the ProfileMe event types mentioned earlier, DCPI measures the instruction's retire delay and the approximate number of cycles that the instruction was in flight. ProfileMe calls these additional measurements "counters." ProfileMe supports four counter configurations as summarized in the table below.
Configuration | Counter 0 | Counter 1 |
pm0 | Retires | Inflight cycles |
pm | Inflight cycles | Retire delay |
pm2 | Retires | B-cache (level 2 cache) misses |
pm3 | Inflight cycles | Memory system replay traps |
If you are interested in a particular measurement, you need to specify the appropriate counter configuration and collect the data. Otherwise, the analysis tools will not be able to summarize and display the desired quantity. The command:
for example, measures the number of retires and B-cache misses.dcpid -event pm2 $DCPIDB
Let's cd
into the database directory and list the directory
contents. The DCPI daemon produces several directories and files.
10 > cd $DCPIDB 11 > pwd /dsk0h/dcpidb 12 > ls 200104041750/ default_scanmap PALcode@ hosts PALcode.positron.65558.131411* index/ PALcode.positron.65558.131411.log version dcpid-positron.log
The log file, dcpid-positron.log
, shows details about the
measurement run. It includes CPU type and speed, process creation
history, etc. dcpid
provides a useful option,
-logmaps
,
to write image information to the log file when an image is loaded.
Here is an example line showing where the "dilate" program image
was loaded:
dcpid: loadmap: loaded : pid 3443 id 3acb5b2e00206526 [0000000120000000..0000000120800000] [scan ] (dilate)
The subdirectory with the long numeric name, 200104041750
,
represents a single measurement epoch (the period of time during which
measurements were taken.) An epoch is created for each successful
measurement run. A new epoch can also be started without terminating
and restarting dcpid
by executing the dcpiepoch
command. The subdirectory name is the starting time (year, month, day,
hour, minute) of the epoch.
An epoch directory contains data for each of the host CPUs (e.g.,
positron
), one subdirectory per host. Within a host
subdirectory are profile data files with long cryptic names such as
"dilate_20010404173438206526." The contents of a profile data file can
be listed using dcpicat
:
dcpicat dilate_20010404173438206526
dcpicat
displays the profile header information (shown below)
followed by a list of address/event counts.
Profile data is kept in a compact form. Address/count
pairs are stored only for those pairs with a non-zero count, so
addresses will probably not be contiguous.
Examining profile data is not for the faint of heart and most users will not really care about the contents of profile files.name dilate image 20010404173438206526 path dilate epoch 200104041750 platform positron text_start 0x000120000000 text_size 0x000000004000
dcpicat
output, however, can be postprocessed and some users may want to write
their own tools to do so.
The dcpiprof
program breaks out event profile data by
binary program image and by procedure within an image. Include the
-i
option on the command line to break out results by
image. The command:
produces an image by image summary of retired instructions as shown below.dcpiprof -pm retired -i
This summary shows the number instruction retire events for each active program or library image during the measurement epoch. The dilation program,Column Total Period (for events) ------ ----- ------ retired:count 353866 63488 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired :count % cum% image 313353 88.55% 88.55% /vmunix 40119 11.34% 99.89% dilate 106 0.03% 99.92% /usr/shlib/libXt.so 105 0.03% 99.95% /dsk0h/dcpidb/PALcode 75 0.02% 99.97% /usr/bin/dcpid 28 0.01% 99.98% /usr/shlib/libc.so 24 0.01% 99.98% /usr/shlib/X11/libdix.so 14 0.00% 99.99% /usr/shlib/libX11.so 11 0.00% 99.99% /usr/shlib/libXm.so 8 0.00% 99.99% /usr/shlib/libmach.so 6 0.00% 100.00% /usr/dt/bin/dtterm 5 0.00% 100.00% /usr/shlib/X11/libxkb.so 4 0.00% 100.00% unknown@positron 2 0.00% 100.00% /usr/shlib/X11/libos.so 2 0.00% 100.00% /usr/shlib/libpthread.so 1 0.00% 100.00% /usr/shlib/X11/libmi.so 1 0.00% 100.00% /bin/sh 1 0.00% 100.00% /usr/shlib/X11/libcfb.so 1 0.00% 100.00% /usr/bin/csh
dilate
, had 40119 instruction
retire events. Note that the summary is system-wide including
event counts for the Tru64 kernel (/vmunix
), the
operating system PALcode (/dsk0h/dcpidb/PALcode
),
the DCPI daemon (/usr/bin/dcpid), various runtime
libraries (*.so
), and shell programs (sh
and csh
.)
Hint: Although the kernel image, /vmunix
, has many
retire events, these are due to "idle time" processing. To reduce
idle time, put the commands to start DCPI, run the application
and stop DCPI in a single shell command line. We'll use this
technique in later examples.
dcpiprof
can produce a summary for any of the
ProfileMe event types as specified by the argument to the -pm
option in the dcpiprof
command line. The commands:
will produce image by image summaries of replay traps, branch mispredicts and unaligned memory access traps, respectively. This is a quick way to find images with performance problems like alignment fix-ups.dcpiprof -pm replay -i dcpiprof -pm mispredict -i dcpiprof -pm unaligntrap -i
dcpiprof
will also produce image by image summaries for
ProfileMe counter values. When we collected data for this example,
we used the default counter configuration (-pm
), which
measures the number of cycles an instruction is inflight and instruction
retire delay. The commands:
produce summaries for the number of inflight cycles for retired instructions and the retire delay for retired instructions, respectively. Note that the counter name must be combined with a ProfileMe event to select instructions that satisfy some event criteria (e.g., retired, suffered a replay trap, etc.)dcpiprof -pm retired:inflight -i dcpiprof -pm retired:retdelay -i
This syntax is sufficient to display results for individual event types and trap bits. ProfileMe supports a more flexible and powerful way to select event data of interest. Recall that each instruction which is sampled using ProfileMe has a set of event bits, trap bits, counter values and PC address associated with it. The ProfileMe software aggregates and organizes sample data using event bits, trap bits and program counter values.
The argument to -pm
in the dcpiprof
command
is really an expression that selects a
set of instructions/samples which satisfy a particular combination
of event flags and trap bits. Two selection operators are
supported:
^
" Logical AND of event/trap bits.
!
" Logical NOT of event/trap bit.
dcpiprof
will find, summarize and display data for all instructions that
satisfy the selection expression.
For example,
selects sample data for instructions that retired without trapping. The exclamation mark "dcpiprof -pm retired^notrap -i
!
" takes the logical NOT of an
event flag or trap bit as shown in this example:
This command selects instructions where a branch was taken without a mispredict. The exclamation mark must be quoted on the command line to avoid misinterpretation by the shell. Backslash, "/", may be used in place of "!".dcpiprof -pm taken^\!mispredict -i
The selection expression may also include the "+
"
operator. The plus operation separates event/trap events and expressions.
dcpiprof
will select samples for each AND/NOT expression (or
event type) and display it in a separate column.
For example, the command:
displays a table with two columns:dcpiprof -pm retired+replay -i
The next example command:retired replays :count % cum% :count % image 313353 88.55% 88.55% 12 70.59% /vmunix 40119 11.34% 99.89% 0 0.00% dilate 106 0.03% 99.92% 0 0.00% /usr/shlib/libXt.so 105 0.03% 99.95% 1 5.88% /dsk0h/dcpidb/PALcode 75 0.02% 99.97% 1 5.88% /usr/bin/dcpid 28 0.01% 99.98% 0 0.00% /usr/shlib/libc.so 24 0.01% 99.98% 1 5.88% /usr/shlib/X11/libdix.so 14 0.00% 99.99% 0 0.00% /usr/shlib/libX11.so 11 0.00% 99.99% 1 5.88% /usr/shlib/libXm.so 8 0.00% 99.99% 0 0.00% /usr/shlib/libmach.so 6 0.00% 100.00% 1 5.88% /usr/dt/bin/dtterm 5 0.00% 100.00% 0 0.00% /usr/shlib/X11/libxkb.so 4 0.00% 100.00% 0 0.00% unknown@positron 2 0.00% 100.00% 0 0.00% /usr/shlib/X11/libos.so 2 0.00% 100.00% 0 0.00% /usr/shlib/libpthread.so 1 0.00% 100.00% 0 0.00% /usr/shlib/X11/libmi.so 1 0.00% 100.00% 0 0.00% /bin/sh 1 0.00% 100.00% 0 0.00% /usr/shlib/X11/libcfb.so 1 0.00% 100.00% 0 0.00% /usr/bin/csh
displays the following two column table:dcpiprof -pm retired+taken^mispredict -i
Use of the "taken^ retired mispredict :count % cum% :count % image 313353 88.55% 88.55% 6 54.55% /vmunix 40119 11.34% 99.89% 0 0.00% dilate 106 0.03% 99.92% 0 0.00% /usr/shlib/libXt.so 105 0.03% 99.95% 1 9.09% /dsk0h/dcpidb/PALcode 75 0.02% 99.97% 1 9.09% /usr/bin/dcpid 28 0.01% 99.98% 0 0.00% /usr/shlib/libc.so 24 0.01% 99.98% 1 9.09% /usr/shlib/X11/libdix.so 14 0.00% 99.99% 0 0.00% /usr/shlib/libX11.so 11 0.00% 99.99% 1 9.09% /usr/shlib/libXm.so 8 0.00% 99.99% 0 0.00% /usr/shlib/libmach.so 6 0.00% 100.00% 0 0.00% /usr/dt/bin/dtterm 5 0.00% 100.00% 0 0.00% /usr/shlib/X11/libxkb.so 4 0.00% 100.00% 0 0.00% unknown@positron 2 0.00% 100.00% 0 0.00% /usr/shlib/X11/libos.so 2 0.00% 100.00% 0 0.00% /usr/shlib/libpthread.so 1 0.00% 100.00% 1 9.09% /usr/shlib/X11/libmi.so 1 0.00% 100.00% 0 0.00% /bin/sh 1 0.00% 100.00% 0 0.00% /usr/shlib/X11/libcfb.so 1 0.00% 100.00% 0 0.00% /usr/bin/csh
+
" sign is a bit unfortunate since it does
not perform a logical OR between event/trap bits. It does let
us create multi-column tables, however.
The ability to select arbitrary sets of
instruction samples is quite general and powerful.
Not all combinations of instruction event flags, trap bits and counter
values are even meaningful or useful, however. Just because you
can ask a question with dcpiprof
doesn't
mean that you should!
Based on the number of retired instructions, the program image
dilate
has the most relevant activity. Let's break down
this image by procedure using dcpiprof
:
The only difference in the command line is the addition of the image name in place of thedcpiprof -pm retire dilate
-i
option.
This command generates the table:
We discussed event selection in the previous section (4.3.) All of those features can be used to produce procedure by procedure summaries of various event types, trap bits, etc.Column Total Period (for events) ------ ----- ------ retired:count 40119 63488 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired :count % cum% procedure image 39848 99.32% 99.32% DilatePixel dilate 116 0.29% 99.61% WriteOutputImage dilate 111 0.28% 99.89% ReadInputImage dilate 35 0.09% 99.98% DilateImage dilate 9 0.02% 100.00% Initialize dilate
The procedure DilatePixel
has the most retired
instructions as expected since this function contains the
inner most loops of the dilation program.
dcpilist
is used to produce an instruction level
analysis of a procedure. The command:
produces an annotated disassembly ofdcpilist -pm retire DilatePixel dilate
DilatePixel
. The
-pm
option works in the same way as dcpiprof
.
Each line of the disassembly is annotated with the selected event
information.
As you can see below, the output from dcpilist
can
be quite lengthy. It's best to redirect the output from dcpilist
to a file and then examine the file with a text editor. The part
of the function with the most traffic (retired instructions) starts at
address 0x1200022f0 and ends at 0x120002370. This code corresponds to the
innermost loop of DilatePixel
.
DilatePixel
- dcpilist analysisDilatePixel: retired :count freq 0 3 0x120002290 ldah gp, 8192(t12) 0 3 0x120002294 ldq_u zero, 0(sp) 0 3 0x120002298 lda gp, 24112(gp) 0 3 0x12000229c ldq_u zero, 0(sp) 3 3 0x1200022a0 addl zero, a0, a0 3 3 0x1200022a4 addl zero, a1, a1 2 3 0x1200022a8 addq a0, a0, v0 3 3 0x1200022ac ldq t1, -32592(gp) 3 3 0x1200022b0 ldq t2, -32568(gp) 1 3 0x1200022b4 lda t0, -32767(zero) 8 3 0x1200022b8 s8addq v0, a0, v0 3 3 0x1200022bc addq a1, a1, a1 0 3 0x1200022c0 addq v0, v0, v0 0 3 0x1200022c4 s8subq v0, a0, v0 2 3 0x1200022c8 lda t1, 0(t1) 7 3 0x1200022cc bis zero, zero, a0 6 3 0x1200022d0 lda t2, 0(t2) 7 3 0x1200022d4 s4addq v0, t1, t1 5 3 0x1200022d8 ldq_u zero, 0(sp) 2 3 0x1200022dc ldq_u zero, 0(sp) 124 124 0x1200022e0 : addq t1, a1, t3 124 124 0x1200022e4 bis zero, zero, t4 107 124 0x1200022e8 bis zero, t2, t5 136 124 0x1200022ec ldq_u zero, 0(sp) 888 897 0x1200022f0 : ldwu t7, 0(t3) 911 897 0x1200022f4 ldwu a3, 0(t5) 932 897 0x1200022f8 addl t4, 0x4, t4 821 897 0x1200022fc lda t3, 8(t3) 909 897 0x120002300 ldwu a4, -6(t3) 936 897 0x120002304 ldwu a5, 2(t5) 891 897 0x120002308 lda t5, 8(t5) 922 897 0x12000230c ldwu t8, -4(t3) 905 897 0x120002310 ldwu t9, -4(t5) 953 897 0x120002314 sextw t7, t7 904 897 0x120002318 sextw a3, a3 863 897 0x12000231c sextw a4, a4 916 897 0x120002320 sextw a5, a5 857 897 0x120002324 addq t7, a3, t7 912 897 0x120002328 ldwu a3, -2(t3) 903 897 0x12000232c addq a4, a5, a4 897 897 0x120002330 cmplt t0, t7, a5 882 897 0x120002334 sextw t8, t8 876 897 0x120002338 sextw t9, t9 1883 897 0x12000233c cmovne a5, t7, t0 861 897 0x120002340 addq t8, t9, t8 887 897 0x120002344 ldwu t9, -2(t5) 890 897 0x120002348 sextw a3, a3 861 897 0x12000234c cmplt t0, a4, t10 1786 897 0x120002350 cmovne t10, a4, t0 894 897 0x120002354 sextw t9, t9 900 897 0x120002358 cmplt t4, 0x1c, a4 880 897 0x12000235c addq a3, t9, a3 896 897 0x120002360 cmplt t0, t8, t7 1791 897 0x120002364 cmovne t7, t8, t0 922 897 0x120002368 cmplt t0, a3, a5 1803 897 0x12000236c cmovne a5, a3, t0 883 897 0x120002370 bne a4, 0x1200022f0 150 124 0x120002374 ldq_u zero, 0(sp) 114 124 0x120002378 ldq_u zero, 0(sp) 109 124 0x12000237c ldq_u zero, 0(sp) 413 388 0x120002380 : ldwu t7, 0(t3) 395 388 0x120002384 ldwu t9, 0(t5) 371 388 0x120002388 addl t4, 0x1, t4 377 388 0x12000238c lda t3, 2(t3) 398 388 0x120002390 cmplt t4, 0x1f, a2 384 388 0x120002394 lda t5, 2(t5) 436 388 0x120002398 sextw t7, t7 389 388 0x12000239c sextw t9, t9 402 388 0x1200023a0 addq t7, t9, t7 377 388 0x1200023a4 cmplt t0, t7, t6 715 388 0x1200023a8 cmovne t6, t7, t0 395 388 0x1200023ac bne a2, 0x120002380 117 124 0x1200023b0 addl a0, 0x1, a0 126 124 0x1200023b4 lda t1, 1084(t1) 138 124 0x1200023b8 cmplt a0, 0x1f, a5 120 124 0x1200023bc lda t2, 62(t2) 134 124 0x1200023c0 bne a5, 0x1200022e0 2 3 0x1200023c4 ldq a4, -32576(gp) 8 3 0x1200023c8 lda a4, 0(a4) 4 3 0x1200023cc s4addq v0, a4, v0 3 3 0x1200023d0 addq v0, a1, v0 6 3 0x1200023d4 stw t0, 0(v0) 4 3 0x1200023d8 ret zero, (ra), 1 0 0 0x1200023dc ldq_u zero, 0(sp)
Interpretation of DCPI/ProfileMe results is the most difficult part of the performance analysis. Superscalar processors in general rely upon effective use of the processing pipeline and the memory hierarchy. Analysis relies upon an understanding of the memory system and pipeline behavior. Behavior of a superscalar pipeline is complex, especially in an out-of-order execution machine like the Alpha 21264, and one of many possible factors may cause a bottleneck. Section 2 of this paper is a brief introduction to Alpha 21264 pipeline behavior. Another resource for information on superscalar computing and Alpha programming is the paper Performance tips for Alpha Linux C programmers. Although a few of the tips in that paper are Linux-specific, many of the techniques apply to Alpha programs generically. The bibliography contains a number of references on the Alpha 21264 design which may also be helpful.
This section is a collection of possible performance problems. We'll describe DCPI/ProfileMe measurements that may indicate the presence of a particular problem. Determination of the exact cause generally involves a close examination of the algorithm, source code and annotated disassembly. It's often important to understand the dynamic behavior of the program especially the way in which the program accesses data in memory. On the Alpha 21264, the immediate instruction execution history (up to 80 instructions) is sometimes relevant since later instructions can be affected by the scheduling/execution of earlier instructions in the pipeline.
A broad indicator of performance is the ratio of retired instructions to raw cycles. This ratio is a rough measure of the degree of instruction level parallelism achieved during program execution. Higher values indicate that more work is being carried out in parallel.
The ratio is easily computed by measuring both retires and cycles, then dividing retires by cycles:
The ratio of retired instructions to cycles in this example is (19536/8361) = 2.336. The61 > dcpid -slot cycles -slot pm $DCPIDB ; dilate heart.ima ball out.ima ; dcpiquit dcpid: load : loaded pcount driver dcpid: monitoring cycles dcpid: monitoring pm dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log Dilation (dilate) untimed `spec' version Image file (heart.ima) read. Structuring element file (ball) read. Output image file (out.ima) written. Leaving program (dilate) dcpictl: quit successful 62 > dcpiprof -i -pm retire -event cycles Column Total Period (for events) ------ ----- ------ cycles 9212 126976 retired:count 20439 126976 The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types. =========================================================== retired cycles % cum% :count % image 8361 90.76% 90.76% 19536 95.58% dilate 724 7.86% 98.62% 816 3.99% /vmunix 50 0.54% 99.16% 39 0.19% /usr/bin/dcpid
dcpid
command above
"multiplexes" or shares the DCPI/ProfileMe hardware counters
between the measurement of CPU cycles and ProfileMe events.
(Yep, you can do that.) The example also demonstrates how
measured "idle" time in the kernel can be minimized by combining
commands into a single line, reducing the amount of "dead" time
due to typing.
The Alpha 21264 is able to retire up to eleven instructions in a single cycle. It can sustain a rate of eight retires per cycle for short periods of time. This peak figure is further limited in real code by the availability (or shortage) of physical integer and floating point registers. In integer code with D-cache references, a maximum of 3.3 retires per cycle could be achieved, but is quite unlikely in practice. In floating point code with D-cache references, the maximum is 2.3 floating point instructions per cycle, again, a maximum that is unlikely to be achieved in real code. Thus, a retire/cycle ratio of 3 in integer code is exceptionally good. If the ratio is in the neighborhood of one retire per cycle, then further improvement is possible and investigation is warranted.
As described in Section 2, the Alpha 21264 fetches instructions aggressively. It uses branch prediction to predict the direction of conditional branches and then fetches instructions based upon the predicted direction. The effectiveness of speculative instruction fetch and execution depends upon the ability of the branch predictor to make predictions correctly.
The ProfileMe cbrmispredict
event bit is set when
a (sampled) conditional branch instruction is mispredicted. For
a given conditional branch instruction, the ratio of its
branch mispredictions to its retires is its misprediction rate.
The misprediction rate is more helpful than the raw number of
mispredictions since it takes into account the number of times
that the instruction was fully executed to retirement.
The following example is taken from measurements made on the dilation
program. The dcpilist
command illustrates another feature
supported in the -pm
, the ability to compute a ratio.
The "::
" operator tells dcpilist
to compute
the ratio of cbrmispredict
and the number of retires.
There are three conditional branch instructions in37 > dcpilist -pm cbrmispredict::retired DilatePixel dilate DilatePixel: cbrmispredict :count :: retired :count freq NA 3 0x120002290 ldah gp, 8192(t12) NA 3 0x120002294 ldq_u zero, 0(sp) ... 0.0000 897 0x120002370 bne a4, 0x1200022f0 ... 0.0000 388 0x1200023ac bne a2, 0x120002380 ... 0.0224 124 0x1200023c0 bne a5, 0x1200022e0 ...
DilatePixel
.
The number of sampled mispredicts for these instructions
are 0, 0 and 3, respectively. The branch at address 0x120002370
is at the bottom of the innermost loop. No mispredicts were measured
indicating that the misprediction rate is quite low, which is good,
of course.
No branch mispredictions were detected because none of the sampled
executions of the conditional branch at 0x120002370
mispredicted. This is not the same as saying "none of the executed branches
was mispredicted." Since ProfileMe is based on a sampling technique,
it's possible that the execution time of the program was too short to
detect even a single misprediction. In general, program runtime should be
long enough to collect a statistically significant set of samples (one
minute or longer.) Alternatively, a short program could be run multiple
times within the same DCPI epoch to increase the amount of data collected.
The taken
ProfileMe event indicates that the sampled
branch was taken. This event can be used to determine the most
frequent direction (taken or not taken) for a branch. If the ratio of
taken events to retires for a particular branch instruction is greater
than 50%, then the taken direction is the most frequent.
The mispredict
trap bit is set for any conditional
branch, subroutine jump (JSR), subroutine return (RET), unconditional
jump (JMP) and jump to coroutine (JMP_COROUTINE) that suffered a
misprediction trap. This bit indicates the failure of both I-cache
line and way (direction) prediction. (The subject of line prediction
is beyond the scope of this paper. See the IEEE Micro paper by Kessler.)
Line and way predictors are correct 85% to 100% of the time for real
world applications. Misprediction rates higher than 15% may indicate
a problem.
If branch mispredictions are a problem, how can they be eliminated or reduced in number? Generally, the 21264 favors straight-line code -- long sequences of instructions that are uninterrupted by branches. Briefly, some suggestions are:
Here are a few fine points about branch prediction. The 21264 uses a tournament branch prediction scheme. It adaptively chooses between a global predictor and a local predictor to make the call and predict branch direction. The choice, global and local predictors are table driven with 4,096, 4,096 and 1,024 entries each, respectively. The choice and global prediction tables are indexed by the most recent global path history (the taken/not taken state of the twelve most recent branches) and the local prediction table is indexed by bits 11:2 of the virtual program counter value of the current branch.
The 21264 fetches four instructions at a time. These four instructions are aligned on a 16-byte boundary. This aligned group of four instructions is called a "quadpack." The branch and I-cache line predictors perform best when there is only one conditional branch per quadpack. If branch density cannot be decreased through the techniques mentioned above, critical branches can be spaced out among quadpacks by inserting "no-op" instructions. There are three no-op instructions:
Mnemonic | Disassembly | Purpose |
UNOP |
LDQ_U R31,0(Rx) |
Universal noop |
NOP |
BIS R31,R31,R31 |
Integer noop |
FNOP |
CPYS F31,F31,F31 |
Floating noop |
These instructions should not encounter operand issue delay, destination issue delay and functional unit delay.
Since the 21264 fetches an entire quadpack at a time, fetch efficiency is maximized when all four instructions execute. If, for example, control enters a quadpack in its middle or at the bottom, then not all instructions in the quadpack will execute and the processor is effectively fetching fewer than four instructions per clock cycle. Fetch efficiency is improved by:
-arch ev6
)
to enable this optimization.
The 21264 detects and handles hazards that may produce incorrect results when instructions are issued and executed out of program order. Some of these hazards are removed before instruction issue as in the case of certain register dependencies. Some hazards, however, cannot be detected until an instruction has issued and has begun execution. Memory system hazards, for example, cannot be detected until load and store instructions have issued and their effective addresses have been evaluated. Such dynamic hazards are eliminated by replaying the instruction stream from an architecturally correct and consistent state. The mechanism for replaying the instruction stream is called the "replay trap."
Section 2.6 of Performance Tips for Alpha Linux C programmers discusses memory system replay traps in some detail. There are five main kinds of 21264 memory system replay traps:
Two broad ProfileMe metrics can be used to identify program components that are taking an unacceptable number of replay traps. The ratio of trapped instructions to retired instructions:
is an indicator of trap frequency (X traps per Y retires.) The ratio of non-retired (aborted) instructions to retired instructions:!notrap:count::retired:count
indicates the ratio of the amount of work discarded to the amount of work completed (X aborts per Y retires.) These ratios can be computed and displayed by!retired:count::retired:count
dcpiprof
to produce an image-by-image
summary. Ideally, both ratios will be small for each image. A trap to
retire ratio of 0.02 or greater and/or an abort to retire ratio of
0.25 or greater may indicate the presence of a replay-related
performance problem. Once the problem has been narrowed to a
particular image, dcpiprof
can display the number of
replay events per procedure to identify particular procedures for
detailed, instruction-level analysis.
Here is a simple program, siz_replays.c
,
that is designed to cause a large number of wrong size
replay traps.
The function#define LOOP_COUNT (16) #define ARRAY_SIZE (16*1024*1024) long int mem_loc[ARRAY_SIZE] ; long int write_part_read_all(int *pA, long int *pB) { *pA = 35 ; return( *pB ) ; } main() { long int i, j, z ; for (i = 0 ; i < LOOP_COUNT ; i++) { for (j = 0 ; j < ARRAY_SIZE ; j++) { // Write the first four bytes at mem_loc[j] // then read all eight bytes causing a wrong // size replay trap z = write_part_read_all((int *)&mem_loc[j], &mem_loc[j]) ; } } }
write_part_read_all
takes two arguments:
a pointer to a 32-bit integer and a pointer to a 64-bit long integer.
The assignment statement writes 35 to the 4 bytes in memory as
referenced by the first pointer. The function then reads and returns
the 8 byte value at the location referenced by the second pointer. The
function main
repeatedly calls write_part_read_all
and passes in the same actual pointer value for both the first
and second pointer parameter. The pointer arguments are aliases
for the same base memory address although the pointers refer to data
items of different size.
Let's examine the assembler language code for write_part_read_all
.
We obtained the assembler language code by compiling the program with
the Compaq C compiler using the -Wf,-S
option:
This produces the fileccc -Wf,-S -o siz_replays siz_replays.c
siz_replays.s
. The assembler code for
the body of write_part_read_all
is:
When# 7 { # 8 *pA = 35 ; mov 35, $1 # 000008 stl $1, ($16) # 9 return( *pB ) ; ldq $0, ($17) # 000009 # 10 } ret ($26) # 000010
write_part_read_all
is called, R16 and R17 will
contain the first and second pointer arguments to the function. The
code moves 35 into register R1 ($1
) and stores the value
of R1 into the 32-bit longword at the address specified by R16
($16
.) The store is immediately followed by a load from
the 64-bit quadword at the address specified by R17 ($17
.)
Because main
passes in a pointer to the same memory
location for the first and second arguments, the store
(stl
) and load (ldq
) will refer to the same
memory address. However, the store instruction will write to the first
four bytes ABCD as shown in the diagram below,
A | B | C | D | E | F | G | H |
The next step is to run the program and collect DCPI/ProfileMe data. Here are the shell commands that we used to collect DCPI/ProfileMe data:
The execution time forsetenv DCPIDB /dsk0h/dcpidb dcpid -event pm $DCPIDB ; siz_replays ; dcpiquit
siz_replays
is a bit short
(about 23.6 seconds), but that should be long enough to capture
some replays.
Hint: Use the Tru64 UNIXtime
command to measure program execution time. If a program has a short execution time, run the program several times during a DCPI measurement epoch (period.) See theman
page fordcpiepoch
to get even more ideas about using measurement epoches.
The following example demonstrates how to produce an image-by-image
summary of the abort to retire ratio and the ratio of discarded (non-retired)
instructions to retired instructions.
The command line illustrates a few short-cuts. Notice that it was not
necessary to specify ::count
following the ProfileMe event
names as ::count
is assumed. Backslash was used to denote
logical NOT instead of the exclamation mark, which must be quoted to
avoid misinterpretation by the shell. The event name retired
could have been abbreviated as ret
, further shortening
the command line.
The ratio of trapped to retired instructions fordcpiprof -i -pm trap::retired+/retired::retired ... !notrap !retired :count :count :: :: retired retired :count :count image 0.2503 5.1905 siz_replays 0.1250 1.5000 /usr/shlib/libpthread.so 0.1000 0.0000 /usr/shlib/libmach.so 0.0714 0.0714 /usr/shlib/libXt.so 0.0484 0.3385 /dsk0h/dcpidb/PALcode 0.0296 0.7757 /vmunix 0.0132 0.5658 /usr/bin/dcpid
siz_replays
is quite large; roughly one instruction suffered a replay trap for every four
instructions that successfully retired. The ratio of discarded
to productive instructions (5.1905) is also very high. Five instructions
were discarded for every retired instruction -- highly inefficient.
Ratios are shown for other components (runtime libraries, the DCPI
daemon, and the kernel), but realistically as an ordinary programmer,
we can only tune the application siz_replays
.
The next step is to identify the procedures with the most replay traps
within a particular program image. Here is the command that we used to
identify write_part_read_all
as the culprit.
We knew this to be so by design, but isolation to a particular procedure is an essential step in replay trap analysis.dcpiprof -pm replays siz_replays ... replays :count % cum% procedure image 11669 100.00% 100.00% write_part_read_all siz_replays
The final step is to use dcpilist
to disassemble and
display the culprit procedure.
Here, the wrong size replay traps have accumulated on the load instruction. The register namesdcpilist -pm retired+replays write_part_read_all siz_replays write_part_read_all: retired replays :count :count freq 4257 0 4208 0x120001190 bis zero, 0x23, t0 4135 0 4208 0x120001194 stl t0, 0(a0) 4223 11669 4208 0x120001198 ldq v0, 0(a1) 4220 0 4208 0x12000119c ret zero, (ra), 1
a0
and a1
are
synonyms for R16 ($16
) and R17 ($17
.)
Before leaving this example, let's rewrite the program to read
a full 64-bit quadword instead of a 32-bit longword. This should
eliminate the wrong size replay traps. The changes
produce the following C code (occ_replays.c
.)
We renamed the first function to#define LOOP_COUNT (16) #define ARRAY_SIZE (16*1024*1024) long int mem_loc[ARRAY_SIZE] ; long int write_read(long int *pA, long int *pB) { *pA = 35 ; return( *pB ) ; } main() { long int i, j, z ; for (i = 0 ; i < LOOP_COUNT ; i++) { for (j = 0 ; j < ARRAY_SIZE ; j++) { // Read from a memory location (mem_loc[j]) after // writing to it causing occasional load-store replay // traps. Alpha 21264 "learns" not to execute the // load ahead of the store, so a replay will occur every // 64k cycles when the "learn" bit is reset. z = write_read(&mem_loc[j], &mem_loc[j]) ; } } }
write_read
and changed
the type of the first parameter such that both formal parameters are
now pointers to long integer. When main
calls
write_read
, it passes in the same address as the
value of the first and second actual parameter. The store followed
by a load from the same address should trigger a load-store replay
trap.
The modified program, occ_replays
, executes in
about 4.5 seconds, considerably faster than siz_replays
.
ProfileMe records load-store order traps, specifically,
via ldstorder
events. Running dcpilist
produces the following transcript.
Onedcpilist -pm ldstorder write_read occ_replays write_read: ldstorder :count 0 0x120001180 bis zero, 0x23, t0 1 0x120001184 stq t0, 0(a0) 0 0x120001188 ldq v0, 0(a1) 0 0x12000118c ret zero, (ra), 1
ldstorder
event was recorded for the store
instruction. Remember that ProfileMe is a sampling process that
chooses instructions at random for profiling, so certainly more
than one load-store order replay trap occurred during execution.
Only one such event was recorded because the frequency of occurrence was
relatively small with respect to the frequency of wrong size
replay traps in siz_replays
.
Why did occ_replays
run faster and why was the
frequency of load-store order replay traps relatively small?
The designers of the Alpha 21264 felt that load-store order
traps would occur frequently enough in real programs to warrant
preventive action. The first time that the 21264 detects the store-load
hazard, it sets a bit in a load wait table. On subsequent
executions, the load instruction is issued after the store
instruction thus avoiding a costly replay trap. The wait table is
periodically cleared (every 32K cycles) to avoid unnecessary waits.
In the example above, the 21264 has "learned" to issue the
store before the load. Load-store order replay traps still occur
whenever the wait table is cleared, but the frequency of occurence
is greatly diminished.
But, wait. There's more. The dcpiprof
summary shows
that 696 replay events were tallied against the program
occ_replays
.
Where are the other replays? The followingdcpiprof -i -pm replays ... replays :count % cum% image 694 94.04% 94.04% occ_replays 43 5.83% 99.86% /vmunix 1 0.14% 100.00% /usr/shlib/libc.so
dcpilist
transcript shows the distribution of replay
events
in the function write_read
.
One of thedcpilist -pm replays write_read occ_replays write_read: replays :count 0 0x120001180 bis zero, 0x23, t0 645 0x120001184 stq t0, 0(a0) 49 0x120001188 ldq v0, 0(a1) 0 0x12000118c ret zero, (ra), 1
replay
events on the store instruction is
the ldstorder
event. The other replay traps are due to
the short, memory-intensive nature of the loop. The Alpha 21264
has a load queue and a store queue to hold pending load and store
instructions, respectively. The queues let the processor core run
ahead of the relatively slow memory system. When a load (store)
operation arrives at a full queue, the load (store) instruction is
replayed. The instruction proceeds when an empty queue entry becomes
available. The load and store instructions in write_read
are within a very short loop with little CPU work to be performed
while the memory system is completing read and write operations.
The load and store queues become full and the instructions are replayed.
The programsiz_replays
also takes some load-store replay traps, which can be exposed usingdcpiprof
anddcpilist
. Short, memory-intensive loops in particular involve complex interactions between CPU and memory units, making analysis difficult.
Clearly, the difficult part of the analysis is determining why the replay traps occurred. The chief and common condition in load-store, load-load, load-miss and wrong size replay traps is that two or more memory instructions are attempting to access the same memory location within a shared issue-execution neighborhood. A load or store instruction taking replay traps must be interfering with the execution of some other nearby load or store instruction (or vice versa.) You should first relate the instructions back to the source code from which they came. Then, think about access to memory data structures within the local neighborhood of the culprit instruction in the source code.
The elimination of memory system replay traps may require a rethinking of algorithm, placement of data structures in memory, size/padding of data structures in memory, indexing and access pattern. Alpha programmers, in general, must given careful consideration to the memory-related behavior of their programs. Here are some suggestions for avoiding replay traps.
-arch
ev6
) may help if the program was compiled for an earlier Alpha
implementation.
The Compaq C compiler tries to avoid replay traps when compiling code for the Alpha 21264. The compiler will analyze pointers and the placement of data structures in memory in order to make certain optimizations. In general, a compiler has only a limited ability to determine if two data structures occupy the same area in memory or if two pointers refer to the same data item. If the compiler cannot make this determination with certainty, it must generate conservative code which is functionally correct under all circumstances. Conservative code will likely have more explicit memory access operations and consequently more opportunity for memory system replay traps. Further, the compiler cannot adjust the scheduling of instructions (that is, how instructions are placed in program order) to increase the spatial and temporal distance between memory accesses to the same data items. Modern compilers can, in general, do a better job of code generation and optimization when the programmer is explicit about data structures and the program's access into them.
Memory hierarchy provides a cost efficient trade-off between memory size and memory speed. It also compensates for the speed imbalance between the high clock and processing rate of the CPU and the much slower speed of bulk memory devices like large capacity RAM and disk drives. Effective use of memory hierarchy depends upon the exploitation of spatial and temporal locality -- having data where it is expected when it is expected. This usually means having the next data item to be used close, in space and time, to the most recently used data item.
There are four key levels in the memory hierarchy of a typical uniprocessor system based on the Alpha 21264.
The L1 I-cache and D-cache are implemented on the same chip as the processor core. They provide the fastest access to instructions and data, but are relatively small (64Kbytes each.) The processor looks in these caches first for instructions and data. If needed data or instructions are not found there, the processor looks to the L2 cache. The L2 cache is a unified cache meaning that it contains both instructions and data, unlike L1 where separate caches are dedicated to instructions and data. The L2 cache is off-chip and is a bit larger, 1 to 16 Mbytes depending upon the system implementation. The XP-1000 workstation, for example, has a 1Mbyte L2 cache. If needed data or instructions are not found in L2 cache, then the processor looks in primary memory as governed by the mapping of the virtual (program) effective address to physical effective address. The data or instructions are returned to the caches and CPU if available within a physical memory page. If the page is non-resident, it must be brought into primary memory from backing store, the data returned, and the program restarted.
Every transition between levels costs time and performance. Thus, it is to the program's benefit to keep instructions and data ready for access in the fastest part of the memory hierarchy. ProfileMe provides event types to assist the location of memory system performance problems. We'll discuss these events starting with primary memory and working in to the central processor.
A page fault occurs when memory access needs data/instructions from
a virtual memory page and the page is not resident in memory. The
dfaulttrap
ProfileMe event indicates that the instruction caused
a data stream fault because the page was inaccessible or because the
virtual address was malformed (e.g., not properly sign extended.) Data stream
faults are easy to locate -- just capture a ProfileMe database and use
dcpiprof/dcpilist
to narrow the location
of dfaulttrap
events to images, procedures and instructions.
To avoid data stream faults, a program should not skip from page to page and should stay within a virtual memory page as long as possible. Page faults have a very high penalty since the program must be suspended until the needed page can be read from disk. This is a delay of many millions of processor cycles.
The Level 2 (L2) cache provides a large, relatively fast cache memory between primary memory and the on-chip L1 cache. The L2 cache is sometimes called the "B-cache" or "board-level cache." Future Alpha processors may implement the L2 cache on-chip, so we prefer the term "L2" when referring to the second level cache. As mentioned earlier, the L2 cache contains both instructions and data. L2 cache lines and items are referenced by physical address. The L2 cache is direct mapped meaning that a given cache line holds information for exactly one address-to-block association (64bytes per block) and that a line is quickly selected/indexed by a subfield within a pending address. If two data or instruction streams contend for the same L2 cache line(s) within a critical inner loop, the cache will thrash as the older entry is continually over-written by information from the newer access. The large size of the L2 cache reduces the probability of thrashing, but the possibility does exist. Of course, thrashing results in poor program performance.
We will use a simple program, walk
, to illustrate
misses in the memory system, including the L2 cache.
Here is the source code for the program walk
,
which walks through a large array and sums the elements of the array.
The array walking loop is executed repeatedly to
increase the execution time of the program and to capture a larger
number of miss events.
We ran the program and collected data under Tru64 on an XP-1000 workstation. The results are shown in the examples below.#define ARRAY_SIZE (16*1024*1024) #define LOOP_COUNT (256) long int array[ARRAY_SIZE] ; main() { int i, j ; long int sum ; sum = 0 ; for (i = 0 ; i < LOOP_COUNT ; i++) { for (j = 0 ; j < ARRAY_SIZE ; j++) { sum = sum + array[j] ; } } }
DCPI/ProfileMe provides four predefined counter configurations that can
be used to measure four performance-related events: number of retires,
inflight cycles, B-cache misses, and memory system replay traps. Each
counter configuration measures two of these four event types.
The argument to the -event
option in the dcpid
command selects the counter configuration as well as enabling the
collection of ProfileMe instruction events. The default configuration
(-pm
) counts inflight cycles and retire delay, so we used
the configuration -pm2
to count retires and B-cache misses.
After collecting data, we used dcpiprof to break down B-cache misses by image and procedure.14 > dcpid -event pm2 $DCPIDB ; walk ; dcpiquit
The B-cache misses have occurred in the procedure15 > dcpiprof -i -pm ret:bcmisses ... retired :bcmisses % cum% image 369463 99.58% 99.58% walk 1249 0.34% 99.92% /vmunix 268 0.07% 99.99% /dsk0h/dcpidb/PALcode 11 0.00% 99.99% /usr/shlib/libpthread.so 7 0.00% 99.99% unknown@positron 7 0.00% 99.99% /usr/shlib/X11/libdix.so 5 0.00% 100.00% /usr/shlib/X11/libos.so 5 0.00% 100.00% /usr/shlib/libc.so 4 0.00% 100.00% /usr/shlib/libXt.so 3 0.00% 100.00% /usr/sbin/xntpd 3 0.00% 100.00% /usr/bin/dcpid 1 0.00% 100.00% /sbin/loader 16 > dcpiprof -pm ret:bcmisses walk ... retired :bcmisses % cum% procedure image 369463 100.00% 100.00% main walk
main
as
expected since it contains the inner program loop.
As usual, we used dcpilist
to display an
instruction-by-instruction breakdown of events. In this case, we
invoked dcpilist
on main
in the program
image walk
.
The inner loop is clearly visible where retire and B-cache miss events have accumulated. The 21264 B-cache miss counter is a "free running" counter and does not associate a B-cache miss with a particular instruction via instruction sampling. This counter behavior plus the rather short, memory-intensive nature of the loop, smears the B-cache miss events across the instructions in the loop. This makes the detective work of miss analysis a little more difficult.39 > dcpilist -pm ret+ret:bcmisses main walk main: retired retired :count :bcmisses freq 0 0 0 0x120001150 ldah gp, 8192(t12) 0 0 0 0x120001154 ldq_u zero, 0(sp) 0 0 0 0x120001158 lda gp, 28432(gp) 0 0 0 0x12000115c ldq_u zero, 0(sp) 0 0 0 0x120001160 bis zero, zero, t0 0 0 0 0x120001164 bis zero, zero, t1 0 0 0 0x120001168 ldah t2, 256(zero) 0 0 0 0x12000116c ldq t3, -32624(gp) 0 0 0 0x120001170 : bis zero, zero, t4 0 0 0 0x120001174 bis zero, t3, t5 0 0 0 0x120001178 ldq_u zero, 0(sp) 0 0 0 0x12000117c ldq_u zero, 0(sp) 67561 58616 67759 0x120001180 : ldq t7, 0(t5) 67868 58798 67759 0x120001184 addl t4, 0x1, t4 67912 58964 67759 0x120001188 lda t5, 8(t5) 67945 58938 67759 0x12000118c cmplt t4, t2, a0 67760 67200 67759 0x120001190 addq t0, t7, t0 67510 66947 67759 0x120001194 bne a0, 0x120001180 0 0 0 0x120001198 addl t1, 0x1, t1 0 0 0 0x12000119c lda a1, -256(t1) 0 0 0 0x1200011a0 blt a1, 0x120001170 0 0 0 0x1200011a4 bis zero, zero, v0 0 0 0 0x1200011a8 ret zero, (ra), 1 0 0 0 0x1200011ac call_pal halt
Good cache behavior is essential to good program performance. A program should work within a small neighborhood of a few data structures, complete its processing there, and then move on to the next neighborhood. Memory data is more likely to be available in cache and to be reused if it was used recently or if its immediate neighbor within the same cache block was used recently. Similar principles hold for program instructions.
When walking an array, use a single stride, sequential access pattern. This kind of behavior exhibits good spatial and temporal locality. C stores array elements in row major order while FORTRAN lays out array elements in column major order. In C, the two-dimensional array:
is arranged in memory in the following way:a00 a01 a02 a10 a11 a12 a20 a21 a22
from lowest to highest address. The rightmost index should change fastest to obtain sequential access. In FORTRAN, the same array has the arrangement:a00 a01 a02 a10 a11 a12 a20 a21 a22
again, from lowest to highest address. Changing the leftmost index fastest yields sequential access. Programmers who are porting software from FORTRAN to C should interchange indices where possible.a00 a10 a20 a01 a11 a21 a02 a12 a22
We'll discuss and analyze cache behavior again in later sections. However,
before leaving discussion of the L2 cache, we must note that data prefetching
is an important technique that improves cache memory performance. Prefetching
starts the transfer of data into cache before it is needed immediately for
computation. Section 5.4.4 illustrates the use and benefits of data
prefetching by inserting prefetches into the example
program walk
. Prefetches
improve the performance of both L1 and L2 cache.
The off-chip memory system uses physical addresses. The Alpha 21264 maps virtual (program or logical) addresses to physical addresses before communicating off-chip memory requests. Unless the operating system exploits contiguous page mapping, virtual memory is subdivided into 8Kbyte pages. The virtual memory pages belonging to a process (program) are described in a page table. Part of this description is where the physical representation of each page can be found in primary memory or on the backing store. This information controls the mapping of virtual addresses to physical, primary memory addresses.
The 21264 uses a hardware structure called a "translation look-aside buffer" to assist the mapping of virtual addresses. The 21264 has two look-aside buffers: the Data Translation Buffer (DTB) and the Instruction Translation Buffer (ITB.) The DTB and ITB each contain 128 entries, where each entry describes the mapping of one virtual memory page. The DTB and ITB are, effectively, cache memories that hold the most recently used address translation information. The operating system loads the DTB and ITB with information from the page table of the running process.
Since the DTB and ITB have 128 entries and pages are 8Kbytes in size, a full TLB can describe at most 1Mbyte of the entire virtual memory space. When translation information for a virtual address is not found in the appropriate TLB, the operating system is called to put the needed information into the TLB, overwriting one of the entries. This call to the operating system is time consuming and frequent TLB misses should be avoided.
ProfileMe provides a way to measure DTB and ITB misses.
The excerpt below captures ProfileMe data for walk
,
displays an image-by-image summary of DTB misses, and identifies
the procedure main
in walk
as the
location of the DTB misses. A similar process can be used to
isolate ITB misses using the ProfileMe itbmiss
trap name instead of dtbmiss
.
The next excerpt shows the annotated disassembly for11 > dcpid -event pm $DCPIDB ; walk ; dcpiquit dcpid: load : loaded pcount driver dcpid: monitoring pm dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log dcpictl: quit successful 12 > dcpiprof -i -pm dtbmiss ... dtbmiss :count % cum% image 58 87.88% 87.88% walk 5 7.58% 95.45% /vmunix 2 3.03% 98.48% /dsk0h/dcpidb/PALcode 1 1.52% 100.00% /usr/dt/bin/dtterm 13 > dcpiprof -pm dtbmiss walk ... dtbmiss :count % cum% procedure image 58 100.00% 100.00% main walk
main
.
The DTB miss events accumulate on the load instruction at 0x120001180 that reads the value of14 > dcpilist -pm retired+dtbmiss main walk main: retired dtbmiss :count :count freq 0 0 0 0x120001150 ldah gp, 8192(t12) 0 0 0 0x120001154 ldq_u zero, 0(sp) 0 0 0 0x120001158 lda gp, 28432(gp) 0 0 0 0x12000115c ldq_u zero, 0(sp) 0 0 0 0x120001160 bis zero, zero, t0 0 0 0 0x120001164 bis zero, zero, t1 0 0 0 0x120001168 ldah t2, 256(zero) 0 0 0 0x12000116c ldq t3, -32624(gp) 0 0 0 0x120001170 : bis zero, zero, t4 0 0 0 0x120001174 bis zero, t3, t5 0 0 0 0x120001178 ldq_u zero, 0(sp) 0 0 0 0x12000117c ldq_u zero, 0(sp) 67840 58 67658 0x120001180 : ldq t7, 0(t5) 67749 0 67658 0x120001184 addl t4, 0x1, t4 67802 0 67658 0x120001188 lda t5, 8(t5) 67500 0 67658 0x12000118c cmplt t4, t2, a0 67611 0 67658 0x120001190 addq t0, t7, t0 67450 0 67658 0x120001194 bne a0, 0x120001180 0 0 0 0x120001198 addl t1, 0x1, t1 0 0 0 0x12000119c lda a1, -256(t1) 0 0 0 0x1200011a0 blt a1, 0x120001170 0 0 0 0x1200011a4 bis zero, zero, v0 0 0 0 0x1200011a8 ret zero, (ra), 1 0 0 0 0x1200011ac call_pal halt
array[j]
from memory.
The principles for reducing DTB misses and achieving better performance are similar to those for virtual memory page faults and caches. Access should be confined to a small section of memory until work is complete and then the program should move on to the next section. For data, this means planning array access carefully. Access into large data heaps should also be managed to avoid randomly skipping around in a heap causing frequent replacement of DTB entries. For instructions, procedures that call each other frequently should be on the same page or at least nearby in memory. Infrequently used procedures, like error handlers, should be located further away in memory to make room for more frequently called procedures.
The 21264 L1 data cache (D-cache) is on-chip and provides fast access to program data when the data is resident in the cache. Integer and floating point load instructions produce register results in 3 and 4 processor cycles, respectively. The D-cache holds 64Kbytes of memory data and is organized as a two-way set associative cache with 64 byte blocks. This means that the D-cache holds two address-to-block associations and that it can tolerate two data streams accessing addresses that map to the same cache index without replacing one of the address-to-block associations. The direct map approach in the L2 cache does not have this tolerance although the larger size of the L2 cache tends to mitigate the effect of line conflicts.
ProfileMe does not provide an event to directly measure D-cache misses. Instead, we can use retire delay to detect and infer the presence of D-cache misses and other memory system delays. Retire delay is a lower bound on the number of cycles that an instruction has delayed its retirement and the retirement of other instructions following it in program order. As we'll see in the example below, retire delay tends to accumulate on the instruction that consumes the result of a load instruction. A relatively high retire delay sometimes indicates the presence of a performance bottleneck.
Retire delay is captured in the usual way as shown below. If you have
used the -pm
option with dcpid
, then you
already have retire delay in the database and do not need to
rerun dcpid
.
The14 > dcpid -event pm $DCPIDB ; walk ; dcpiquit dcpid: load : loaded pcount driver dcpid: monitoring pm dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log dcpictl: quit successful
-pm
option uses the default ProfileMe counter configuration.
This configuration also captures inflight
, which is approximately
the number of cycles that an instruction was inflight.
The dcpiprof
program will display an image-by-image summary
of retire delay and will also break down retire delay by procedure.
Although these breakdowns identify the program images and procedures with the most retire delay, retire delay is more meaningful at the instruction level. The following example shows the disassembly of the procedure15 > dcpiprof -i -pm ret:retdelay ... retired :retdelay % cum% image 1215336 97.98% 97.98% walk 18548 1.50% 99.47% /vmunix 6086 0.49% 99.96% /dsk0h/dcpidb/PALcode 250 0.02% 99.98% /usr/bin/dcpid 74 0.01% 99.99% /usr/shlib/libc.so 49 0.00% 99.99% /usr/shlib/libXt.so 40 0.00% 100.00% /usr/shlib/libX11.so 28 0.00% 100.00% /usr/shlib/libpthread.so 6 0.00% 100.00% /var/subsys/pcount.mod 5 0.00% 100.00% /usr/shlib/libmach.so 5 0.00% 100.00% /usr/shlib/X11/libos.so 5 0.00% 100.00% unknown@positron 4 0.00% 100.00% /usr/sbin/routed 3 0.00% 100.00% /usr/shlib/X11/libdix.so 2 0.00% 100.00% /usr/shlib/libevm.so 1 0.00% 100.00% /usr/shlib/libcfg.so 16 > dcpiprof -pm ret:retdelay walk ... retired :retdelay % cum% procedure image 1215336 100.00% 100.00% main walk
main
in walk
.
The load (17 > dcpilist -pm ret+ret:retdelay main walk main: retired retired :count :retdelay freq 0 0 0 0x120001150 ldah gp, 8192(t12) 0 0 0 0x120001154 ldq_u zero, 0(sp) 0 0 0 0x120001158 lda gp, 28432(gp) 0 0 0 0x12000115c ldq_u zero, 0(sp) 0 0 0 0x120001160 bis zero, zero, t0 0 0 0 0x120001164 bis zero, zero, t1 0 0 0 0x120001168 ldah t2, 256(zero) 0 0 0 0x12000116c ldq t3, -32624(gp) 0 0 0 0x120001170 : bis zero, zero, t4 0 0 0 0x120001174 bis zero, t3, t5 0 0 0 0x120001178 ldq_u zero, 0(sp) 0 0 0 0x12000117c ldq_u zero, 0(sp) 67840 3590 67658 0x120001180 : ldq t7, 0(t5) 67749 0 67658 0x120001184 addl t4, 0x1, t4 67802 0 67658 0x120001188 lda t5, 8(t5) 67500 0 67658 0x12000118c cmplt t4, t2, a0 67611 1144229 67658 0x120001190 addq t0, t7, t0 67450 67517 67658 0x120001194 bne a0, 0x120001180 0 0 0 0x120001198 addl t1, 0x1, t1 0 0 0 0x12000119c lda a1, -256(t1) 0 0 0 0x1200011a0 blt a1, 0x120001170 0 0 0 0x1200011a4 bis zero, zero, v0 0 0 0 0x1200011a8 ret zero, (ra), 1 0 0 0 0x1200011ac call_pal halt
ldq
) instruction at 0x120001180 reads the array
element array[j]
from memory. After the value is read, it
is stored in register t7
. Although memory system misses are
caused by the load, the retire delay accumulates on the add instruction
(addq
) that consumes the result of the load in register
t7
. Thus, the general procedure for analyzing misses is
to identify the instruction with the highest retire delay and then
work back in the program to find any load instructions that produce
the register arguments to the delayed instruction. The load instructions
are the likely culprits.
As an experiment, let's manually insert a prefetch instruction into
the inner loop of walk
to improve D-cache behavior.
The Compaq C compiler supports a
built-in function, asm
, which can be used to insert
assembler language instructions into a program. Here is the source
for the modified program, pwalk
.
The program first includes#include <c_asm.h> #define ARRAY_SIZE (16*1024*1024) #define LOOP_COUNT (256) #define PREFETCH 8 long int array[ARRAY_SIZE+PREFETCH] ; main() { int i, j ; long int sum ; sum = 0 ; for (i = 0 ; i < LOOP_COUNT ; i++) { for (j = 0 ; j < ARRAY_SIZE ; j++) { asm("ldl $31,($16)", &array[j+PREFETCH]) ; sum = sum + array[j] ; } } }
c_asm.h
to define asm
.
The body of the loop remains the same except, a call to asm
:
has been inserted. The call passes in the address of the array element that is one cache line ahead. The address is made available to the prefetch instruction in register R16 (asm("ldl $31,($16)", &array[j+PREFETCH]) ;
$16
.) Thus, the
prefetch instruction will begin the process of accessing the array elements
to be used from the next cache line while the array elements in the current
cache line are being read and summed.
Write hints can also be inserted into a program usingasm
:A write hint tells the memory system that a cache block will be completely overwritten. The memory system clears the entry for this block in the cache without reading its data from memory. For this reason, write hint must be used carefully; its behavior is destructive.asm(".arch ev6;wh64 ($16)", address)
We used the Tru64 time
command to measure the execution
time of the original walk
program (no prefetch) and
pwalk
(prefetch inserted.) The walk
program
executes in 49.68 seconds and pwalk
executes in 36.71
seconds. We used DCPI/ProfileMe to measure the total number of cycle
samples within the inner loops of walk
and pwalk
.
Measurement | walk |
pwalk |
Execution time | 117.44 seconds | 58.93 seconds |
Total cycles in inner loop | 1,223,128 | 615,265 |
walk
and pwalk
.The ratio of inner loop cycles (1,223,128 / 615,265 = 1.98) is roughly the same as the ratio of program execution times (117.44 / 58.93 = 1.99.)
The ProfileMe measure inflight
is approximately the
number of instructions that an instruction is inflight. We measured
and compared the inflight cycles for the inner loop of walk
:
and the inner loop ofretired valid :count :inflight freq ... 67840 8085612 67658 0x120001180 : ldq t7, 0(t5) 67749 8078173 67658 0x120001184 addl t4, 0x1, t4 67802 8098298 67658 0x120001188 lda t5, 8(t5) 67500 8037100 67658 0x12000118c cmplt t4, t2, a0 67611 9160497 67658 0x120001190 addq t0, t7, t0 67450 9204279 67658 0x120001194 bne a0, 0x120001180
pwalk
:
The number of inflight cycles for each instruction has been dramatically reduced in68161 4296917 67735 0x120001190 : ldl zero, 0(t5) 67484 4252675 67735 0x120001194 ldq t7, -64(t5) 67990 4284110 67735 0x120001198 addl t4, 0x1, t4 67698 4263218 67735 0x12000119c lda t5, 8(t5) 67578 3760276 67735 0x1200011a0 cmplt t4, t3, a0 67579 4299603 67735 0x1200011a4 addq t0, t7, t0 67658 4374610 67735 0x1200011a8 bne a0, 0x120001190
pwalk
. Retire delay is also reduced.
As in the case of virtual memory pages and B-cache behavior, D-cache performance is best when a program works within a small neighborhood of memory (32Kbytes or less on the Alpha 21264), completes processing within the neighborhood, and then moves on to the next one. Prefetching is a way to tell the memory system that a certain block of memory will be used next and that it should be brought into the D-cache. This will make that block of memory available in D-cache when it is actually read or written.
The L1 instruction cache (I-cache) provides quick access to the most recently used program instructions. The CPU looks here first for instructions before requesting them from the L2 cache, or if necessary, primary memory. The 21264 needs a steady, predictable stream of instructions to deliver high performance. The L1 I-cache quickly delivers instructions to the 21264 fetch unit in order to keep the rest of the processor pipeline busy. Like D-cache, performance suffers when I-cache miss rates are high.
The ProfileMe event nyp
(Not Yet Prefetched) indicates
that when the 21264 fetch unit requested a block containing an
instruction, the instruction was not in the I-cache and that an
off-chip request for the instruction had not yet been initiated. If
this event is asserted, then the instruction fetch operation
definitely caused an I-cache miss stall. There are conditions under
which an I-cache miss can still occur and nyp
will not be
asserted. Thus, nyp
tends to understate the occurrence of
I-cache miss stalls.
Generally, I-cache behavior is best when the most frequently executed repetitive code such as an inner loop or group of closely interrelated subroutines are fully contained in the instruction cache. This minimizes the likelihood of I-cache collisions between critical stretches of code. Code placement using a high level language is indirect at best. But, interrelated, frequently executed functions could be grouped together in a source module as the compiler will lay the functions down in order in memory. The two-way set associative organization of the Alpha 21264 I-cache can tolerate two-way conflicts. Earlier Alpha implementations with direct mapped I-cache are more sensitive to collisions.
I-cache behavior is also sensitive to procedure in-lining. Benefit can be gained by expanding in-line a frequently called procedure, such as a function that is called from the body of a critical inner loop. Procedure in-lining reduces call overhead, reduces branch mispredicts, and increases opportunities for optimization between the call site and the body of the procedure itself. Expansion often increases overall code size, perhaps making a critical inner loop so big that it no longer fits in the I-cache resulting in poor I-cache behavior and performance. Thus, in-lining should be used with caution -- measure test runs with and without procedure in-lining to see which is best.
The Alpha instruction set architecture (ISA) favors the use of aligned memory items. An aligned data item can be read or written with a single load or store instruction. Unaligned memory accesses usually require several instructions to update a data item that straddles an aligned address boundary.
C compilers for Alpha assume aligned memory access. The compiler aligns data structures and generates small, fast code. Sometimes, however, programmers make bad assumptions about data alignment, type casting or pointer arithmetic. Or perhaps a program was written for a specific (non-Alpha) architecture and it embodies architecture-specific assumptions about data alignment. When run on Alpha, the program may use pointer/address values that are not aligned according to the Alpha ISA. These unaligned memory accesses cause alignment traps, which are directed to the operating system. The operating system fixes up the instruction, completes the memory read or write operation, and returns control to the program. These "alignment fix-ups" are costly and can rob program performance.
We will illustrate the effect of unaligned memory accesses and traps
with the simple program below (unalign
.) The program
deliberately causes alignment traps by computing an unaligned address
and then reading a 32-bit word at that address.
The functionunsigned char two_64bit_words[16] ; unsigned int f(unsigned int *pInt) { return( *pInt ) ; } main() { int i ; unsigned char *pChar ; unsigned int *pInt ; unsigned int vInt ; pChar = two_64bit_words ; for (i = 0 ; i < 8 ; i++) two_64bit_words[i] = 0xFF ; for (i = 8 ; i < 16 ; i++) two_64bit_words[i] = 0x0 ; for (i = 0 ; i < 80000000 ; i++) { pInt = (unsigned int *)(pChar+6) ; vInt = f(pInt) ; } printf("pInt is %8x\n", pInt) ; printf("vInt is %8x\n", vInt) ; }
f
takes a pointer to an unsigned integer
and returns the value of the integer. When f
is compiled,
it produces a single instruction (ldl
) to read the value
of the unsigned, 32-bit integer from memory (alignment assumed.)
The array two_64bit_words
is allocated on a quadword
aligned address boundary. The quadword aligned base address of the
array is stored in the pointer variable pChar
.
The function main
calls f
with a
pointer into two_64bit_words
that has been forced
to an unaligned address:
The address arithmetic and cast to a different pointer type should start alarm bells ringing since this assignment effectively goes around C language type enforcement. We chose an offset of six bytes to force the unsigned integer across a word aligned address as illustrated below. The "ABCD" markers denote the four bytes in the unsigned integer to be read. Aligned 32-bit integer boundaries are at indices 0, 4, 8 and 12, where an aligned access could be performed.pInt = (unsigned int *)(pChar+6) ;
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
A | B | C | D |
The session excerpt below runs the program unalign
and
captures ProfileMe data. The dcpiprof
commands narrow
the occurrence of unaligntrap
events to the function
f
in the program unalign
.
The following58 > uac p noprint 59 > dcpid -event pm $DCPIDB ; unalign ; dcpiquit dcpid: load : loaded pcount driver dcpid: monitoring pm dcpid: logging to /dsk0h/dcpidb/dcpid-positron.log pInt is 400001a6 vInt is ffff dcpictl: quit successful 60 > dcpiprof -i -pm unaligntrap ... unaligntrap :count % cum% image 1362 100.00% 100.00% unalign 61 > dcpiprof -pm unaligntrap unalign ... unaligntrap :count % cum% procedure image 1362 100.00% 100.00% f unalign
dcpilist
excerpt displays an annotated
disassembly for f
showing the number of retire events
and unaligntrap
events.
The load instruction (62 > dcpilist -pm ret+unaligntrap f unalign f: retired unaligntrap :count :count freq 0 1362 615 0x1200011b0 ldl v0, 0(a0) 1230 0 615 0x1200011b4 ret zero, (ra), 1 0 0 0 0x1200011b8 ldq_u zero, 0(sp) 0 0 0 0x1200011bc ldq_u zero, 0(sp)
ldl
) reads the 32-bit integer
at the memory location specified by the address in register a0
,
which holds the pointer argument to the function f
. The
load instruction is the site of many unaligned address traps since
we forced the pointer to be an unaligned address. None of the load
instructions retired because the operating system was called to fix-up
and complete the trapped instruction on behalf of the hardware and
the program.
Hint: If you try this example on Tru64 UNIX, be sure to enter theuac p noprint
command. By default, Tru64 does not let alignment faults go by silently. This command disables alignment error messages, and without it, your display screen will fill with error messages. See theman
page onuac
for more details.
The Tru64 time
command shows that unalign
consumes 6.32 execution seconds of user (program) time and 29.35
seconds of operating system time:
The program is spending over three times as much time in the operating system as it is spending in the program itself. The inefficiency is even worse than it appears because time in the operating system PALcode is accumulated in user time. The image-by-image summary of retired instructions tells the tale:63 > time unalign pInt is 400001a6 vInt is ffff 6.32u 29.35s 0:35 99% 0+1k 0+1io 0pf+0w
Only 8,754 application instructions were retired while nearly 419,220 instructions were spent on alignment fix-ups.64 > dcpiprof -i -pm retire ... retired :count % cum% image 310682 72.57% 72.57% /vmunix 108538 25.35% 97.92% /dsk0h/dcpidb/PALcode 8754 2.04% 99.96% unalign 86 0.02% 99.98% /usr/bin/dcpid
The Compaq and GNU C compilers can be directed to assume the use of unaligned data and to generate conservative Alpha code that will not cause alignment traps. This quick fix may, at first, appear attractive. However, alignment traps usually indicate a conceptual error somewhere in the program and they should instead be treated as a bug and should be properly corrected and removed. ProfileMe identifies the offending load/store instructions, which can easily be related back to the source code for further investigation.
Effective exploitation of instruction level parallelism (ILP) depends upon the availability of internal computational resources. Resources in the Alpha 21264 pipeline include multiple integer function units, multiple floating point function units, load queue, store queue, etc. In order to size these resources, the chip architects simulated and analyzed real world programs and made trade-offs between program performance and the number and size of on-chip resources. Although the Alpha 21264 is able to support a high degree of ILP, the number and capacity of its computational resources are limited. If a particular stretch of code (temporarily) uses up a particular resource and another instruction arrives and makes its request for the resource, then that instruction must wait for the resource to become free and available again before it may proceed. Sometimes this wait results in a pipeline stall and lost processor cycles.
The 21264 uses queues to balance processing rates between certain pipeline stages and central processing units.
The load and store queues fill when a large number of read/write operations have been initiated and the memory unit cannot keep up with the requests. This is most likely to occur in code that is load/store intensive. Again, the solution is to thin out or reduce the density of load/store operations. Bursts of load/store activity should be separated by computation, allowing the memory unit to work off the backlog of memory requests before making new requests. Data prefetching will also help as data will be available in faster L1 D-cache and the memory unit will be able to process memory requests at a much higher rate.
The instruction queues may become full when the fetch and mapping units have produced more instructions than can issue into the integer and floating point execution units. The mapping stage stalls when an instruction queue is full since it does not have a place to put newly mapped instructions. This may be due to poor instruction scheduling and/or a lack of availability of function units. Certain kinds of operations must be performed in a particular function unit. For example, MIN/MAX and PACK/UNPACK MVI operations must be performed in unit zero of the upper integer subcluster, integer multiply must be performed in unit one of the upper integer subcluster and floating point multiply must be executed in the floating point multiplier. If a large number of instructions needing a particular unit occur within a small neighborhood of the program, then contention for the unit, and waiting instructions, will result. Conceivably, if the number of waiting instructions is large enough, the instruction queue will fill. A corrective action is to "thin out" instructions with specialized needs, separating them by instructions that do not create additional contention for the function unit in demand. Unfortunately, a higher level language programmer may only be able to indirectly affect the instructions laid down by the compiler by rearranging source code.
The mapping stage may also stall due to a shortage of physical registers. Physical registers hold intermediate results and are allocated from a pool of integer registers and a pool of floating point registers. A physical register is allocated to an instruction when the instruction is mapped and placed into the instruction queue. The register remains allocated while the instruction is in flight. The physical register is released when its instruction is retired. Of the 80 physical integer registers, 33 registers are available to hold the results of instructions that are inflight. Of 72 physical floating point registers, 37 registers are are available to hold the results of inflight instructions. The mapping stage of the pipeline will stall:
See Compiler Writer's Guide for the Alpha 21264 for details on register availability and allocation.
The mapping stage has one other dynamically allocated resource: the instruction number or "inum." An instruction number is assigned to each instruction and is used for bookkeeping during execution. There are 80 inums available. The mapping stage may stall due to a shortage of instruction numbers, but in practice, this is unlikely.
The ProfileMe map_stall
event is asserted when an
instruction stalls after it was fetched and before it was mapped.
Map stalls can be isolated to particular images and procedures in
the usual way (dcpiprof
and dcpilist
.)
Stalls at the mapping stage can be reduced through better code
scheduling to minimize the amount of time that instructions remain
inflight. Universal no-op instructions (UNOP
) can be
inserted into the code to reduce fetch bandwidth, thereby slowing
down the pipeline stages immediately preceding the mapper.
Unfortunately, a programmer working in a higher level language
can only indirectly the code schedule by rearranging the source
program.
The Alpha 21264 and ProfileMe support other event types and trap bits. This section provides a brief description of them.
"Floobydust" is a contemporary term derived from archaic Latin miscellaneous, whose disputed history probably springs from Greek origins (influenced, of course, by Egyptian linguists) -- meaning here "a mixed bag." From Audio Handbook, © 1976 by National Semiconductor Corporation.
The early_kill
and late_kill
events,
when asserted, specify that an instruction was killed early in the
pipeline (before issue) or late in the pipeline (after issue.)
Early and late kill give some indication of the cost associated
with a killed instruction. Instructions killed after issue have
already consumed resources and may have been partially executed.
Thus, late kills are more expensive than early kills.
The trap bits include exception conditions that occur very infrequently, or not at all in a well-behaved program. These trap bits are:
arithtrap
: The instruction caused an arithmetic trap.
fpdisabledtrap
: An attempt was made to execute
a floating point instruction while floating point instructions were
disabled.
MT_FPCRtrap
: The floating point control register (FPCR)
is emulated on the Alpha 21264. This trap is taken when execution of
an MT_FPCR
(move to FPCR) instruction is attempted.
iacvtrap
: The instruction caused an I-stream
access violation trap. An illegal access to the instruction was
attempted as specified by the current processor mode and page
protection.
OPCDECtrap
: An attempt was made to execute a reserved
instruction or execute a privileged instruction in user mode.
interrupt
: The instruction was preempted by an interrupt.
mchktrap
: The instruction caused a machine check trap.
Machine checks occur when the hardware detects a hardware error condition.
interrupt
, these bits indicate the
presence of a serious error condition or bug.
We provided a brief introduction to the use of DCPI/ProfileMe to identify performance problems on Alpha, concentrating on the Alpha 21264A processor. ProfileMe and the 21264A use a new approach to performance analysis called "instruction sampling." Although instruction sampling differs from program counter sampling, it provides precise information about instruction-level program behavior.
Internal notes and presentations by Mark Vandevoorde, Mike Burrows, Kip Walker, Steve Root, Jeff Dean, Jamey Hicks, Mark Davis, Michael Bedy and Phil Ezolt were essential to the preparation of this paper. Thanks!
Special thanks go to Sharon Smith whose comments vastly improved this paper, especially the introduction. Thanks also go to Mark Davis for his careful review and for contributing replay examples. His contributions made the replay section much stronger.
The first three papers below describe the inner workings and application of ProfileMe and DCPI. The third paper is an analysis of the Compaq ES40 system, which is based on the 21264. ProfileMe was used in the analysis of the ES40.
ProfileMe: hardware support for instruction-level profiling on out-of-order processors, Dean, J., Hicks, J.E., Waldspurger, C.A., Weihl, W.E., and Chrysos, G., Proceedings Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, December 1997, pages 292 - 302.
Continuous Profiling: Where have all the cycles gone?, J. Anderson, L.M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.-T. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W.E. Weihl, SRC Technical Note 1997-016a, Also published in Proceedings of the 16th Symposium on Operating System Principles and ACM Transactions on Computer Systems, Volume 15, Number 4, November 1997, pages 357-390.
Transparent, Low-Overhead Profiling on Modern Processors, Workshop on Profile and Feedback-Directed Compilation, Paris, France, 13 October 1998.
Performance analysis of the Alpha 21264-based Compaq ES40 system, Cvetanovic, Z. and Kessler, R.E., Proceedings of the 27th International Symposium on Computer Architecture, 2000, June 10-15 2000, pages 192 - 202.
The following papers are "higher-level" introductions to the Alpha 21264 processor. They provide more detail on the inner workings of the 21264 chip.
The Alpha 21264 microprocessor, Kessler, R.E., IEEE Micro, March-April 1999, pages 24 - 36.
The Alpha 21264 microprocessor architecture, Kessler, R.E., McLellan, E.J., and Webb, D.A., Proceedings. International Conference on Computer Design: VLSI in Computers and Processors, 1998. ICCD '98, October 5-7, 1998, pages 90 - 95.
If you would like to build a 21264, here's where to start. These papers are mainly of interest to logic/circuit designers.
The Alpha 21264: a 500 MHz out-of-order execution microprocessor, Leibholz, D. and Razdan, R.. Proceedings of IEEE Compcon '97., February 23-26, 1997, pages 28 - 36.
A 600 MHz superscalar RISC microprocessor with out-of-order execution, Gieseke, B.A., Allmon, R.L., Bailey, D.W., Benschneider, B.J., Britton, S.M., Clouser, J.D., Fair, H.R., III, Farrell, J.A., Gowan, M.K., Houghton, C.L., Keller, J.B., Lee, T.H., Leibholz, D.L., Lowell, S.C., Matson, M.D., Matthew, R.J., Peng, V., Quinn, M.D., Priore, D.A., Smith, M.J., and Wilcox, K.E., Digest of Technical Papers: IEEE International Solid-State Circuits Conference, 1997. 43rd ISSCC., 1997, pages 176 - 177, 451.
High-performance microprocessor design, Gronowski, P.E., Bowhill, W.J., Preston, R.P., Gowan, M.K., and Allmon, R.L., IEEE Journal of Solid-State Circuits, May 1998, pages 676 - 686.
Designing High Performance Microprocessors, Gronowski, P., Digest of Technical Papers: Symposium on VLSI Circuits, June 1997, pages 51 - 54.
Design tradeoffs in stall-control circuits for 600 MHz instruction queues, Fischer, T. and Leibholz, D., Digest of Technical Papers. 1998 IEEE International, Solid-State Circuits Conference, 1998, February 5-7, 1998, pages 232 - 233, 442.
Issue logic for a 600-MHz out-of-order execution microprocessor, Farrell, J.A. and Fischer, T.C., IEEE Journal of Solid-State Circuits, May 1998, Volume: 33 Issue: 5, pages 707 - 712.
Timing verification of the 21264: A 600 MHz full-custom microprocessor, Shriver, E.J., Hall, D.H., Nassif, N., Rahman, N.E., Rethman, N.L., Watt, G., and Farrell, J.A., Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1998, ICCD'98, October 1998, pages 96 - 103.
Circuit implementation of a 600 MHz superscalar RISC microprocessor, Matson, M., Bailey, D., Bell, S., Biro, L., Butler, S., Clouser, J., Farrell, J., Gowan, M., Priore, D., Wilcox, K., Proceedings of the International Conference on Computer Design: VLSI in Computers and Processors, 1998, ICCD'98, October 1998, pages 104 - 110.
A 600-MHz superscalar floating-point processor, Clouser, J., Matson, M., Badeau, R., Dupcak, R., Samudrala, S., Allmon, R. and Fairbanks, N. IEEE Journal of Solid-State Circuits, July 1999, Volume 34, Number 7, pages 1026 - 1029.
A 600 MHz CMOS PLL microprocessor clock generator with a 1.2 GHz VCO, von Kaenel, V., Aebischer, D., van Dongen, R., and Piguet, C., Digest of Technical Papers. 1998 IEEE International Solid-State Circuits Conference, February 5-7, 1998, pages 396 - 397.
Clocking design and analysis for a 600 MHz Alpha microprocessor, Fair, H. and Bailey, D., Digest of Technical Papers. 1998 IEEE International Solid-State Circuits Conference, February 5-7, 1998, pages 398 - 399.
Additional tips and information about attaining higher performance on Alpha may be found in the resources below.
Performance tips for Alpha Linux C programmers, Compaq Computer Corporation, December 2000.
Compaq Computer Corporation, Compiler Writer's Guide for the Alpha 21264, Order number EC-RJ66A-TE, June 1999.
Compaq Computer Corporation, Tru64 UNIX Programmer's Guide, August 1999.
Tip #1 Use the Compaq C compiler. It has been specially tuned for Alpha.
Tip #2 Identify the Alpha implementation in your machine. Tell the compiler to produce code for that particular implementation.
Tip #3 Use hand-tuned library functions wherever possible.
Tip #4 Ask the compiler to unroll loops.
Tip #5 Ask the compiler to generate procedures in line.
Tip #6 Use C constructs that encourage the compiler to generate conditional move instructions instead of branches.
Tip #7 Use local variables to help the compiler analyze and optimize code.
Tip #8 Consider data cache size when sizing and stepping through arrays.
Tip #9 Use aligned data structures and data access.
Tip #10 Detect, identify and remove performance robbing alignment faults.
Tip #11 Help the compiler to keep data items as close to the processor as possible.
Tip #12 Help the compiler exploit software pipelining by keeping inner loop code straightforward and by keeping dependencies simple.
Tip #13 Design and code for good temporal and spatial locality.
Tip #14 Avoid array sizes and strides that are multiples of the cache size.
Tip #15 Use data structures and code organization that are "virtual memory ready."
There are four key elements in DCPI: the DCPI driver
(pcount), the DCPI daemon (dcpid
), the DPCI database
and the analysis programs.
The driver and the DCPI daemon work together to collect information about a running application. The driver is a module in the Tru64 kernel that samples the hardware program counters. This information is, in turn, passed to the DCPI daemon. As the volume of collected data grows, the DCPI daemon writes it to the DCPI database. The DCPI database is the link between the DCPI daemon and the analysis programs. After collection is completed, the analysis programs read data from the database, analyze it, and prepare results.
The DCPI daemon must be run by the superuser. It is possible to run the DCPI daemon using setuid-root. However, this method is not secure and it is not recommended.
The DCPI installation script is interactive. If so directed, the
installation script will create a database for profile data,
will generate an initial scan map, and will put the scan map into
the database directory. The user must specify the path to the database
directory. The initial scan map file, default_scanmap
,
is a list of program and library images. When the installation
script creates the scan map, it traverses the file system looking
for program and library images to add to the list. Usually, this
process takes several minutes, but it may take longer to
complete if the file system has a large number of application
packages installed. The scan map can be recreated or refreshed
using the dcpiscan
program. The scan map file can
be copied to new or different DCPI collection databases. The
scan map helps the DCPI daemon to identify binary images.
This paper assumes that a database directory was created during
installation. In the examples, the database directory is located
at /dsk0h/dcpidb
. We defined the environment variable,
DCPIDB
, to the path to the database directory. This
makes it easier to specify the database directory path to the
DCPI daemon, dcpid
. The programs dcpiprof
and dcpilist
take the database directory path from
the DCPIDB
variable.
DCPIDB
to the directory path.
Next, start the DCPI daemon, specifying the profile event data to collect and the directory path.35 > setenv DCPIDB /dsk0h/dcpidb
The36 > dcpid -event cycles $DCPIDB
-event
option specifies the kind of profile
information to collect. The dcpid
command
starts the daemon running in the background and
leaves the shell "live" for more commands. Now it's time to run
a CPU intensive application like dilate
, which
performs morphological dilation on an image:
After the program runs, run37 > dilate heart.ima ball out.ima
dcpiquit
to write out any
remaining performance data in the driver/daemon buffers and stop the
collection process.
The38 > dcpiquit
dcpiflush
command tells the daemon to
flush data without stopping collection. dcpiquit
ends
the current measurement epoch (period.) The dcpiepoch
command closes the current epoch and starts a new epoch without
stopping the daemon and ending collection. This allows more flexibility
during measurement. The analysis commands all provide several ways
to select a particular epoch (and associated set of profile data)
for analysis. See the man
page for dcpiepoch
for more information.
Disclaimer: The execution times cited in this technical note are
solely intended to compare the relative merits of certain coding
practices, optimizations, etc. Your mileage may vary.
Copyright © 2001 Hewlett Packard Company