Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
50 Cards in this Set
- Front
- Back
What drives CPU speed? |
- Faster clock cycle (usually through increased transistor density) - Smaller Cycles Per Instruction (CPI) |
|
What is a control hazard? |
When the processor must stall (can't fetch next instruction) while waiting for the result of a branch instruction. |
|
What is the CPI due to pipelining? |
Ideally, 1. |
|
Why is branch misprediction a problem? (4) |
- Branch instructions are 15-25% of all instructions. - Object-oriented programming has more indirect branches harder to predict by the compiler. |
|
In general, how does branch prediction work? |
Uses history to predict the future. More specifically, a global predictor uses a global history vector to index into a local history vector. |
|
What is the CPI due to branch prediction? |
Ideally, 1 (down from >1 in non-ideal case of pipelining). |
|
What is a superscalar architecture? |
A superscalar architecture executes multiple instructions at the same time on a single processor core (instruction-level parallelism) by using the multiple functional units on the core. |
|
What is the CPI due to a superscalar architecture? |
<1 |
|
What is out-of-order execution? |
Re-ordering instructions without control/data hazards to reduce stalls from long instruction cycles. |
|
What is the CPI due to out-of-order execution? |
<<1 |
|
What are some challenges of out-of-order execution? (2) |
- Detecting the dependency is hard. |
|
What is deep pipelining? |
Simply a deeper (more stages) pipeline. Since more operations happen, circuitry simplifies and clock speed can be increased, decreasing cycle time (per stage). |
|
What is Simultaneous Multithreading (SMT) (a.k.a. Hyperthreading)? |
SMT is when a single core executes multiple instructions from multiple threads at the same time. |
|
What is the CPI due to SMT? |
<<<1 |
|
What's are 3 advantages and 1 disadvantage of 64-bit vs. 32-bit architectures? |
Advantages: - Better performance for precise digit math. |
|
Why measure a program? (3) |
Program: - To optimize a program, find the bottleneck. - To compare implementations. - To find a bug (why so slow?). Computer - To compare which is better/faster. |
|
What are 3 bad ways of comparing processors and why? |
- Clock frequency: IPC for each could be very different. - CPI or IPC: Depends on the instruction set used and the compiler. - FLOPS: Depends on the instruction set used and only matters if you use floating point heavily. |
|
How do you measure a processor? |
Use wall-clock time, |
|
What is Amdahl's Law? |
The expected improvement after only optimizing a portion of a program. Simply put, |
|
How is Amdahl's Law useful? |
It allows you to optimize the common case. However, note that the common case may change after one round of optimization. |
|
What are 3 tools for profiling? |
- Software timers (C library/OS-level timers) - Hardware timers and performance counters - Instrumentation (gprof/gcov) |
|
What information does /usr/bin/time give? |
The wall-clock time (real), the time spent in user mode (user), and the time spent in kernel mode (sys).
Note that the user time is the sum of time taken on all cores (so it could be larger than wall-clock). |
|
What information does the x86 rdtsc instruction give? How can Linux programs use it? |
rdtsc (read time stamp counter) returns the current value of the core's cycle count. |
|
Give 2 examples of software timers. |
/usr/bin/time C library time (sys/times.h) |
|
What are some disadvantages of hardware performance counters? |
- Require OS support. - Can overflow. - Must be sampled carefully. |
|
What is one (new) tool to easily access hardware performance counters? |
perf |
|
What are some disadvantages of instrumentation? |
The observer effect - You can't measure the system without disturbing it. Instrumentation code can slow down execution. |
|
How does gprof work? |
- Adds code to all function calls to track caller and number of times called; Uses this to create a call-graph with number of times called. - FIXME:Adds an interrupt handler which interrupts program every ~10ms; On interrupt, checks current user frame to determine running function and uses this to approximate how long function ran. |
|
What are some limitations of gprof? |
- Periodic interrupts to measure timing is relatively inaccurate by nature. - Uses user frame to determine currently running function and to determine parent; Because of this, it can't work with 3rd party binaries (note that "-pg" will use "-pg" C libraries). |
|
How does gcov work? |
- Adds code to every basic block to count number of times each basic block is entered (and thus each line is executed). - Can be used to determine if code is executed at all. |
|
What are some limitations of gcov? |
- It provides no timing information. |
|
How does valgrind work? |
- It is a virtual machine that does just-in-time (JIT) compilation. - Adds instrumentation dynamically. - Uses this instrumentation to determine memory issues like leaks or overruns and where in the code they occurred. |
|
What are some disadvantages of valgrind? |
- Emulation makes profiled program 4-5x slower. - (Kirk-special) Doesn't detect stack overruns. |
|
What are the goals of a compiler? (5) |
- Correctness - Support for debugging - Compile for performance - Compilation is fast - Small code size |
|
What is a basic block? |
A group of consecutive instructions with a single entry point and a single exit point. |
|
What are the 3 requirements of performance optimization? |
- Preserve correctness. - Improve average performance. - Be "worth the effort." |
|
What is a good formula for execution time and what does it imply about the focus of optimizations? |
Execution time = #instructions * CPI * time/cycle |
|
What is the role of an optimizing compiler? |
To provide an efficient mapping of the program to the machine. - Code selection and ordering - Register allocation |
|
What are 3 limitations of optimizing compilers? |
- They are constrained in an effort to maintain program behaviour. - Most analysis is only within procedures (inter-procedural is too expensive). - Most analysis is based on static information (run-time inputs are difficult for compiler to predict).
When in doubt, the compiler must be conservative. |
|
What is the role of a programmer with an optimizing compiler? |
- Don't smash code into oblivion. - Select the best algorithm. - Write readable, maintainable code. - Eliminate optimization blockers. - Focus on inner loops. |
|
What are 6 machine independent compiler optimizations? |
- Constant propagation - Constant folding - Common subexpression elimination (CSE) - Dead code elimination (DCE) - Loop invariant code motion (LICM) - Function inlining |
|
What are 3 machine dependent compiler optimizations? |
- Instruction scheduling - Loop unrolling - Parallel unrolling |
|
What are the 3 types of cache misses? |
- Cold (compulsory) miss - Miss due to block not in cache. - Conflict miss - Miss because two blocks map to same cache slot. - Capacity miss - Number of active blocks is larger than cache. |
|
Why do caches work? |
- Temporal locality - Spatial locality |
|
What 6 things are necessary for prefetching to be effective? |
- There is spare memory before prefetching - Prefetches are accurate - Prefetches are timely - Prefetched data doesn't replace in-use data - Speed gains must outweigh prefetch cost |
|
Are conflict misses a concern nowadays? |
No, because L1 & L2 caches have high associativity. |
|
What is your main concern when working with caches? |
Keep the working set within on-chip cache capacity (focus on L1/L2 as necessary). |
|
What is your main concern when working with virtual memory? |
- Keep working set within main memory. |
|
What is your main concern when working with prefetching? |
- Try to arrange your data structures and access patterns to favour sequential/strided access. |
|
Why worry about alignment? |
Hardware is partitioned into blocks. If we don't align, our requests may span multiple blocks wasting time. |