• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/50

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

50 Cards in this Set

  • Front
  • Back

What drives CPU speed?

- Faster clock cycle (usually through increased transistor density)


- Smaller Cycles Per Instruction (CPI)

What is a control hazard?

When the processor must stall (can't fetch next instruction) while waiting for the result of a branch instruction.

What is the CPI due to pipelining?

Ideally, 1.

Why is branch misprediction a problem? (4)

- Branch instructions are 15-25% of all instructions.


- Object-oriented programming has more indirect branches harder to predict by the compiler.
- Processors with deeper pipelines suffer larger penalties.
- Processors using superscalar instructions must flush and refetch more instructions.

In general, how does branch prediction work?

Uses history to predict the future. More specifically, a global predictor uses a global history vector to index into a local history vector.

What is the CPI due to branch prediction?

Ideally, 1 (down from >1 in non-ideal case of pipelining).

What is a superscalar architecture?

A superscalar architecture executes multiple instructions at the same time on a single processor core (instruction-level parallelism) by using the multiple functional units on the core.

What is the CPI due to a superscalar architecture?

<1

What is out-of-order execution?

Re-ordering instructions without control/data hazards to reduce stalls from long instruction cycles.

What is the CPI due to out-of-order execution?

<<1

What are some challenges of out-of-order execution? (2)

- Detecting the dependency is hard.
- You might introduce new hazards.

What is deep pipelining?

Simply a deeper (more stages) pipeline. Since more operations happen, circuitry simplifies and clock speed can be increased, decreasing cycle time (per stage).

What is Simultaneous Multithreading (SMT) (a.k.a. Hyperthreading)?

SMT is when a single core executes multiple instructions from multiple threads at the same time.

What is the CPI due to SMT?

<<<1

What's are 3 advantages and 1 disadvantage of 64-bit vs. 32-bit architectures?

Advantages:
- Larger address space for large databases, etc.
- Near impossible register overflow.


- Better performance for precise digit math.
Disadvantage:
- Code size increases (instruction size increases).

Why measure a program? (3)
Why measure a computer? (1)

Program:


- To optimize a program, find the bottleneck.


- To compare implementations.


- To find a bug (why so slow?).


Computer


- To compare which is better/faster.

What are 3 bad ways of comparing processors and why?

- Clock frequency: IPC for each could be very different.


- CPI or IPC: Depends on the instruction set used and the compiler.


- FLOPS: Depends on the instruction set used and only matters if you use floating point heavily.

How do you measure a processor?

Use wall-clock time,
time = IC * CPI * ClockPeriod

What is Amdahl's Law?

The expected improvement after only optimizing a portion of a program. Simply put,
Speed-Up = Old-Time / New-Time = 1/(1 - f + f/s)
f = Normal time for portion being optimized
s = Speed-up of portion due to optimization

How is Amdahl's Law useful?

It allows you to optimize the common case. However, note that the common case may change after one round of optimization.

What are 3 tools for profiling?

- Software timers (C library/OS-level timers)


- Hardware timers and performance counters


- Instrumentation (gprof/gcov)

What information does /usr/bin/time give?

The wall-clock time (real), the time spent in user mode (user), and the time spent in kernel mode (sys).



Note that the user time is the sum of time taken on all cores (so it could be larger than wall-clock).

What information does the x86 rdtsc instruction give? How can Linux programs use it?

rdtsc (read time stamp counter) returns the current value of the core's cycle count.
It can be accessed via the get_tsc() function.

Note however that multiple calls to get_tsc() may return different core's counters altogether because the program started on one core and was scheduled to another later.

Give 2 examples of software timers.

/usr/bin/time


C library time (sys/times.h)

What are some disadvantages of hardware performance counters?

- Require OS support.


- Can overflow.


- Must be sampled carefully.

What is one (new) tool to easily access hardware performance counters?

perf

What are some disadvantages of instrumentation?

The observer effect - You can't measure the system without disturbing it. Instrumentation code can slow down execution.

How does gprof work?

- Adds code to all function calls to track caller and number of times called; Uses this to create a call-graph with number of times called.


- FIXME:Adds an interrupt handler which interrupts program every ~10ms; On interrupt, checks current user frame to determine running function and uses this to approximate how long function ran.

What are some limitations of gprof?

- Periodic interrupts to measure timing is relatively inaccurate by nature.
- gprof measures time programming is running; So blocking calls when program is waiting for I/O won't be measured.


- Uses user frame to determine currently running function and to determine parent; Because of this, it can't work with 3rd party binaries (note that "-pg" will use "-pg" C libraries).

How does gcov work?

- Adds code to every basic block to count number of times each basic block is entered (and thus each line is executed).
- Can be used to determine important loops.


- Can be used to determine if code is executed at all.

What are some limitations of gcov?

- It provides no timing information.
- For complex conditions, it provides no information on which sub-expression evaluated to true.

How does valgrind work?

- It is a virtual machine that does just-in-time (JIT) compilation.


- Adds instrumentation dynamically.


- Uses this instrumentation to determine memory issues like leaks or overruns and where in the code they occurred.

What are some disadvantages of valgrind?

- Emulation makes profiled program 4-5x slower.


- (Kirk-special) Doesn't detect stack overruns.

What are the goals of a compiler? (5)

- Correctness


- Support for debugging


- Compile for performance


- Compilation is fast


- Small code size

What is a basic block?

A group of consecutive instructions with a single entry point and a single exit point.

What are the 3 requirements of performance optimization?

- Preserve correctness.


- Improve average performance.


- Be "worth the effort."

What is a good formula for execution time and what does it imply about the focus of optimizations?

Execution time = #instructions * CPI * time/cycle

Thus, focus on
- Decreasing CPI (avoid hazards, improve cache/memory behaviour)
- Decreasing #instructions (use special/new instructions)

What is the role of an optimizing compiler?

To provide an efficient mapping of the program to the machine.
- Eliminate minor inefficiencies


- Code selection and ordering


- Register allocation

What are 3 limitations of optimizing compilers?

- They are constrained in an effort to maintain program behaviour.


- Most analysis is only within procedures (inter-procedural is too expensive).


- Most analysis is based on static information (run-time inputs are difficult for compiler to predict).



When in doubt, the compiler must be conservative.

What is the role of a programmer with an optimizing compiler?

- Don't smash code into oblivion.


- Select the best algorithm.


- Write readable, maintainable code.


- Eliminate optimization blockers.


- Focus on inner loops.

What are 6 machine independent compiler optimizations?

- Constant propagation


- Constant folding


- Common subexpression elimination (CSE)


- Dead code elimination (DCE)


- Loop invariant code motion (LICM)


- Function inlining

What are 3 machine dependent compiler optimizations?

- Instruction scheduling


- Loop unrolling


- Parallel unrolling

What are the 3 types of cache misses?

- Cold (compulsory) miss - Miss due to block not in cache.


- Conflict miss - Miss because two blocks map to same cache slot.


- Capacity miss - Number of active blocks is larger than cache.

Why do caches work?

- Temporal locality


- Spatial locality

What 6 things are necessary for prefetching to be effective?

- There is spare memory before prefetching


- Prefetches are accurate


- Prefetches are timely


- Prefetched data doesn't replace in-use data


- Speed gains must outweigh prefetch cost

Are conflict misses a concern nowadays?

No, because L1 & L2 caches have high associativity.

What is your main concern when working with caches?

Keep the working set within on-chip cache capacity (focus on L1/L2 as necessary).

What is your main concern when working with virtual memory?

- Keep working set within main memory.
- May also want to keep # working set pages < # TLB entries

What is your main concern when working with prefetching?

- Try to arrange your data structures and access patterns to favour sequential/strided access.
- If not possible or if ambitious, try compiler or manually-inserted prefetch instructions.

Why worry about alignment?

Hardware is partitioned into blocks. If we don't align, our requests may span multiple blocks wasting time.