• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/213

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

213 Cards in this Set

  • Front
  • Back
Dies per wafer =
PI(Wafer diameter/2)^2 / Die area – PI * Wafer diameter / (2*Die area)^1/2
What are the parts of an ISA?
set of instructions (arg fields, assembly syntax, machine encoding), named storage locations (reg&mem), addressing modes (naming locations), types and sizes of operands, control flow instructions, memory-mapped i/o interface
Stack code for

D = A – (B+C)
Push B
Push C
Add
Push A
Sub
Pop D
Accumulator code for

D = A – (B+C)
Load B
Add C
Store X
Load A
Sub X
Store D
Reg-Reg code for

D = A – (B+C)
Load R1, B
Load R2, C
Add R3, R1, R2
Load R4, A
Sub R5, R4, R3
Store R5, D
Reg-Mem code for

D = A – (B+C)
Load R1, B
Add R2, R1, C
Sub R3, A, R2
Store R3, D
Big endian

little endian
least significant bit is on the left

strings appear backwards
Calling Conventions:

Caller saves

Callee Saves

MIPS
call, caller saves registers that will be needed later, even if callee did not use them

inside the call, called procedure saves regs it will overwrite, more efficient if many small procedures

some registers caller-saves, some callee saves for optimal performance
Local vs global optimizations
local is within a basic block, global is across branches
Pipelining
multiple instructions are overlapped. Takes advantage of parallelism. (ILP)
5 Basic Pipestages
fetch (get inst from mem), decode (figure out what to do), read operands, execute, write back
What is pipeline Efficiency
Speedup / num stages
Throughput
Efficiency/Clock period
Structural Hazard
Not enough functional units
Data Hazard
Results of earlier instructions not yet available
Control hazards
decisions from branches are not yet available so we don't know which instruction to execute
Precise exception handling
Exceptions are maintained in correct program order i.e., when an exception occurs all instructions before the exception-causing instruction will be allowed to complete and all the ones behind (including this one) will be killed from the pipeline.

10x slower, easier for integer than FP, useful for debug
Imprecise exception handling
When exceptions are not handled in program order e.g., out-of-order exceptions.

Only correct on most common cases, statistical guarantee of correctness through testing
What 4 properties of the ISA DESIGN can create problems? State the problems with each property.
1) Variable instruction length & runtimes: introduces delays, complicates hazard-detection & precise exceptions

2) Sophisticated addressing modes: post-auto increment complicates hazard-detection, restarting, introduces WAR/WAW hazards

3) Multiple-indirect modes complicate pipeline control and timing

4) Self-modifying code: could overwrite an instruction in the pipe
5 Steps of scoreboarding
1) Fetch instruction from cache
2) Issue inst to exec (when no struct or WAW hazards),
3) Read ops (when no RAW hazards)
4) Execute, notify scoreboard on completion
5) Write (when no WAR hazards)
4 Limiting factors of scoreboarding
1) available parallelism among instrs (crossing basic block can help),
2) number of scoreboard inst table entries,
3) number and types of FUs,
4) presence of name dependences (WAR/WAW hazards)
Pipeline CPI =
ideal CPI + hazard stalls
Forwarding helps with
data hazard stalls
Delayed branches & branch scheduling helps with
control hazard stalls
Dynamic scoreboarding helps with
RAW stalls
Register renaming helps with
WAR & WAW stalls
Branch Prediction helps with
control hazard stalls
Multiple inst per cycle increases
ideal CPI
Hardware speculation helps with
data and control hazard stalls
Dynamic memory disambiguation helps with
data hazard mem stalls
Loop unrolling helps with
control hazard stalls
Compiler scheduling helps with
data hazard stalls
Compiler dependence analysis increases what and reduces what
ideal CPI

data hazard stalls
Software pipelining & trace scheduling increases what and reduces what
ideal CPI, data hazard stalls
Hardware support for compiler speculation increases what and reduces what
ideal CPI, data hazard stalls
3 Unrolling considerations
1) decreased returns with each unroll,
2) growth in code size,
3) register pressure
Branch-prediction buffer (branch history table (BHT))
low-order n bits of branch address used to index a table of branch history data. May have collisions between distant branches.
Problems with BHT in RISC. What can fix it?
in fetch, don’t know if inst is a branch, don’t know target yet, by the time you know this (in ID), you already know if it’s taken (no time savings). BTargetBuffers fix this.
How can we reduce misprediction frequency?
increase buffer size, use different prediction scheme (correlated)
How to implement a correlated predictor?
implemented as m-bit shift register
How can we get perfect prediction?

What are the disadvantages?
Take both paths (parallel speculative execution).

Large penalty in area, energy, and clock speed.
5 Advantages of Tomasulo
1) Compiler can’t take care of memory dependences it can’t see. SW 100(R1),R6; LW R7, 36 (R2);

2)gets rid of WAW and WAR through register renaming. (Doesn’t get rid of RAW)

3) distributed: hazard det. & inst issue is done per exec unit (scoreboarding goes through central unit)

4) Data results go straight to where they are needed.

5)Loads/stores have their own exec units.
4 Types of Tomasulo Unit Components

What do they each do?
1) Reservation stations (RSs dynamic renaming registers),

2) issue logic (redirects instrs outputs to reservation station slots, results direct to RSs),

3) Distributed hazard detection (handled separately by each FU),

4) load & store buffers (queue memory access requests).
3 Key concepts of tomasulo
1) dynamic scheduling,
2) register renaming,
3) dynamic memory disambiguation
3 Steps in tomasulo
1)Issue: get inst from queue. If RS slot open, send inst, else stall. Send operands to RS if available, else note the names in the RS. Rename the registers.
2)Execute: while operands are not available, monitor CDB. When operands are in RS, start executing.
3)Write results: when result avail & CDB free, write to CDB, then registers, & RS/store slots.
3 Tomasulo Drawbacks
1) complex (lots of hardware), less important as transistors/die increases (imp for low-power processors and chip multiprocessors – CMPs)

2) difficult to perform associative access to many RS entries at high speed

3) CDB can be a limiting factor (multi CDBs possible, but adds overhead in RS write ports)
4 Cases when Tomasulo is Most Useful
1) One needs to run binaries for earlier pipeline implementations

2) Code is difficult to schedule statically – many dependences through memory

3) Not enough programmer-visible regs to do static reg renaming

4) There are many FUs available, and WAR and WAW makes scoreboarding bad
Speculation
Improves ILP by overcoming control dependence. Fetch, issue, exec as if predictions always correct. Data flow execution model. Ability to undo effects. Uses commit update reg and mem. Dynamic scheduling = only fetch and issue. Deals with scheduling different combinations of BBs.
What is ROB?

What is it used for?

What info does it store?
Reorder Buffer

Passes results from instructions that are currently speculated, extends RSs.

Stores: inst type, dest, value, ready. FIFO as issued. Allows undo. Handles exceptions on commit.
4 Steps in Speculative Tomasulo Algorithm:
1. Issue (dispatch): if RS and ROB slot free, issue inst and send ops & ROB no for dest
2. Exec: when both operands ready, exec. If not ready, watch CDB. Checks RAW.
3. Write result: write to CDB to all waiting FUs & ROB; mark RS available
4. Commit: when instr at head of ROB & result present, update register with result (or store to mem) and remove instr from ROB. Mispredicted branches flush ROB.
What does speculation do to memory hazards?
1) avoids WAW and WAR hazards b/c updating occurs in-order

2) RAW hazards are maintained by not allowing load to initiate second step of exec if any active ROB store entry has dest that matches the value of the address field of the load. Maintaining prog order for comp of effective address with respect to all earlier stores
Value prediction
predicts val of load that changes infrequently, only good if value does not change often.
Fine grain thread switching

Advantage:

Disadvantage:
alternate thread per inst. Round-robin, skipping stalled threads.

Can hide stalls.

Slows down exec of individual threads.
Coarse grain thread switching

Advantage:

Disadvantages(2):
alternate when a thread is stalled (L2 cache miss). Advantages, doesn’t need very fast switching.

Doesn’t slow indiv thread.

Disadvantages: losses of shorter stalls, when stall occurs must empty pipe, new thread must fill pipe.
SMT: simultaneous multithreading

Requires:
Large set of virtual regs hold reg sets of threads. Reg renaming gives unique identifiers for multiple threads. Out-of-order completion allows threads to utilize max HW.

1) Large reg file needed.

2) Keeping separate PC and ROB for each thread.

3) Uses fine grained, but can use a preferred thread approach.
What is SISD?

What kind of processors use this?
single instruction stream single data stream – uniprocessors
What is SIMD?

What is it's purpose?

What kind of processors use this?
Single instruction multiple data stream – exploits data level parallelism by applying the same op to multiple pieces of data at once. Single instruction mem and control processor which dispatches instructions. Particularly good for graphics.
What is MISD?
Multiple Instruction Single Data

No commercial ones to date
What is MIMD?

What kind of processors use this?
Multiple Instruction Multiple Data

thread level parallelism. Each processor fetches its own instructions and operates on its own data. Flexible and generally applicable.
What is a Commodity cluster?

What is it used for?
Uses third-party processors and networking.

Web-servers and other applications that require a lot of TLP on separate processes.
What is a Custom cluster?

What is it used for?
Specialized programmer created node designs and/or networking code.

For scientific applications and other problems requiring a lot of power for a single problem.
What is SMP?

When does it become less efficient?

What kind of memory access does it use?
symmetric shared memory multiprocessor: share a single centralized memory, large caches, possibly with several banks. Uses multiple point to point connections or a switch.

Becomes less efficient as the number of processors increases.

Employs UMA (uniform memory access), all processors have the same memory latency
What is a distributed memory multiprocessor?

What does kind of memory access does it use?
Multiprocessor with individual nodes containing processor, memory, I/O, and an interconnection interface. Nodes could contain multiple processors. Much more scalable since most memory accesses are local. Reduces the latency for access to the local memory. Data communication between processors can become complex. Uses DSM and NUMA.
What is DSM?
distributed shared memory. Any mem ref can be made by any processor to any location assuming they have the proper access rights. They have a shared address space.
What is NUMA?
Non Uniform Memory Access. Access times depend on the location of a word in memory.
What are the hurdles of parallel programming?
limited parallelism and high cost of communications
Directory based CCP
the sharing status of a block is kept in the shared directory.
Coherence defines
behavior of reads and writes to the same memory location
Consistency defines
behavior of reads and writes with respect to accesses to other memory locations
Cache coherence protocols (CCP)
hardware implementation that allows migration and replication
What is Snooping?

What does it require?
Every cache that has a copy of the data also has a copy of the sharing status, no centralized state is kept.

Requires a broadcast mechanism via bus or switch. Cache controllers snoop the bus to see if they have a copy of the requested block. Uses a pre-existing physical connection (bus). Broadcasting makes it simple, but limits scalability.
What is write invalidate protocol?

What does it require to serialize writes?
Ensures exclusive access before writing to a location. Most common protocol.

Requires serialized invalidate access to the bus to serialize writes. If a processor has a dirty copy of the block on snooping invalidate on the bus, the value is supplied in response and causes mem access to be aborted. Existing cache tags and valid bit are used for snooping, add shared state bit.

Absence of centralized structure is both main advantage and prevents scalability.
What is write update (broadcast) protocol?

What is it's disadvantage?
Sends writes to all cache lines containing the block. Requires lots of bandwidth.
What is a true sharing miss?
It arises from the communication of data through the cache coherence mechanism. The word being read is invalid.
What is a false sharing miss?
Use of an invalidation based coherence mechanism through a single valid bit per cache block. As in, another word in the block is invalid, not the one we’re reading.
What is directory based cache coherency protocol?
Single location for each block’s information. When shared, one directory has a vector for each word indicating which other processors have a copy of the word.
Cache Coherency Protocol States:
Shared
Uncached
Modified
One or more processors have the block cached and the val in mem is up to date

No processor has a copy of the block cache

One processor has a copy of the cache block, and it has written the block, so the value in memory is out of date. The processor is called the owner.
Nodes in directory based cache coherency protocol
Local
Home
Remote
Node where a request originates

Node where the memory location and the directory entry of an address reside. Address space is statically distributed, so the directory for a particular address is always the same.

Node that has a copy of the block in cache.
What are Load linked/store conditional?

How are they used?
Assembly commands

Used together, if the value is changed after LL but before SC, the SC fails. Used for locking. Can insert reg-reg instructions between and check if they were done atomically.
Assembly code for a LL/SC spin lock:
lockit: LL R2, 0(R1)
BNEZ R2, lockit ;not avail, spin
DADDUI R2,R0,#1 ;locked value
SC R2,0(R1) ;store
BEQZ R2,lockit ;branch if fails
What is a Data race free program?
A fully synchronized program
What does sequential consistency require?
Requires that the result of any execution be the same as if the memory accesses within a processor act as if they are executed in order and the accesses among different processors were arbitrarily interleaved. This could be done by requiring that the processor delay all memory accesses until the completion of any invalidations caused by that access. We could also delay the next memory access until the previous one has completed.
Total store ordering (Processor consistency) relaxes what?
Relaxes W->R consistency but maintains write consistency.
Weak ordering (release consistency) relaxes what?
Relaxes R->W and R->R consistency.
What is Strict ordering?
A read to a memory location returns the most recent write
What is relative speedup?
Comparison of the same program
What is true speedup?
Comparisons of the best available version of the program for each platform
What is atomic exchange?
Interchanging a value in register for a value in memory
How does test and set work?
Tests the value and sets it if the value passes a test for atomic operation
How does fetch and increment work?
Returns a value of a mem location and increments it.
What is cache?

What principle supports the use of cache?
Originally meant the highest or first level of memory hierarchy once the address leaves the processor. Now applied whenever buffering is employed to reuse commonly occurring items. May be SRAM or fast DRAM.

Works on the Principle of Locality (temporal and spatial).
What is a cache block?
A fixed size collection of data in the cache
What is the principle of temporal locality?
Words used now are likely to be needed again in the near future.
What is the principle of spatial locality?
When an memory location is accessed, it is likely that other nearby memory locations will be accessed soon.
Formula for Cache size
block size * num sets * set associativity
Block size =
2^num offset bits
Num sets =
2^num index bits
#tag bits
#mem address bits – num index bits – num offset bits
LRU Bits =
set size! permutations, or 4! = 24 For 4 way set associative, or 5 bits
What is cache Latency?
time to retrieve the first word of the block
What is cache bandwidth?
Determines time to retrieve a block after the first word is found
What does in-order-execution do?
blocks/stalls all instructions until data is available
How does out-of-order-execution work?
instruction using the result of a cache miss must wait, but other instructions can continue
What is virtual memory?
cache objects residing on disk, broken into fixed sized blocks called pages.
What is a page fault?
When a processor references an item in a page that is not in cache or main memory. The entire page is moved from disk to main memory.
CPU execution time =
(CPU clock cycles + Memory stall cycles)*Clock cycle time
Memory stalls =
IC * Mem accesses per instruction * miss rate * miss penalty
Where is a block placed in direct mapped cache?
block has only one place it can be in the cache, block address mod num blocks in cache
Where is a block placed in fully associative cache?
block can be anywhere in the cache
Where is a block placed in set associative cache?
block can only be in a restrictive set of locations. Block is mapped onto a set, then the block can be anywhere in that set.
What is Direct Mapped cache?
1-way set associative. All blocks have a specific location using block address MOD number frames (sets) in cache. No LRU replacement can be used, every block just maps to a specific frame. Faster to find a block (faster clock cycle), but more likely to have a miss because of temporal locality.
What is fully associative cache?

What is it's advantage?

What is it's disadvantage?
there is only 1 frame set. Meaning, when we need to get a block out, we will have to search the whole cache and find the tag that matches. But we can throw blocks into any empty memory location or the LRU location.

Less likely to have a miss

Slower to find a block (slower clock cycle)
What is a cache set?
a group of blocks in the cache, usually chosen by bit selection.
What is the formula for bit selection?
(Block address) MOD (Number of sets in cache)
What is the Block offset?
Used to select the desired data from the block. Should not be used in the comparison, since the entire block is either present or not.
What is the tag field used for?
compared against for a hit
What is the index field used for?
Selects the set. Not used in comparison since it would be redundant since the index was used to select the set in the first place.
What is LRU block replacement?
Record when each block is used, replace the one that was used longest ago. Can be expensive.
What is FIFO block replacement?
Just replace the oldest block. Cheaper than LRU.
What is a write through cache?
Data is written to both the block in the cache and the lower level memory at the same time. Easier to implement than write back, cache is always clean. Simplifies data coherency.
What is a write back cache?
Data written to cache only, it is written to main memory when it is replaced. Writes occur at the speed of cache memory. Multiple writes require only one lower level write. Uses less mem bandwidth, good for multiprocessors. Also uses less power.
What does the dirty bit indicate on a block?
Bit indicating whether the block has been modified while in the cache. If it is clean, there’s no reason to write it back on a miss, since it already exists in the lower memory.
What is a write stall?
processor waiting for writes to complete during write through
What is a write buffer?
Reduces write stall delays, allows overlapping of processor execution with memory updating
What is write allocate?
Block is allocated on a write miss. Write misses act like read misses.
What is no-write allocate
Write misses do not affect the cache. Block is modified only in lower level memory. Blocks stay out of cache until the processor tries to read the block.
What is the advantage of separating instruction and data caches?

What is the disadvantage?
Adv: can optimize for each, double the bandwidth.

Disadv: more complex design, can’t dynamically adjust space
What is a compulsory miss?
A miss that will occur even if you had an unlimited cache. Such as the very first access to a block.
What is a capacity miss?
If all the blocks needed during execution of a program cannot be kept in cache, there will be capacity misses. This will result in blocks being discarded and later retrieved.
What is a conflict miss?
If the cache is not fully associative, blocks will be discarded because another block will be put into the same set.
What 4 cache optimizations reduce the miss rate?
Larger block size (compulsory)
Bigger cache (capacity)
Higher associativity (conflict)
Compiler optimizations
What 3 cache optimizations reduce miss penalty?
Multilevel cache
Critical word first
merging write buffers
What 5 cache optimizations reduce hit time?
Give priority to reads over writes

Avoid address translation during indexing of cache

Small and simple caches

Way prediction

Trace caches
What 3 cache optimizations increase cache bandwidth?
Pipelined, multibanked, and nonblocking caches
What can hardware prefetching and compiler prefetching do for caches?
Reduce miss penalty or miss rate via parallelism
What is way prediction in caches?
use prediction bits in the cache to predict the next block to be accessed (85% accuracy)
What is hit under miss in caches?
continue to supply cache hits during a miss.
What is sequential interleaving in caches?
A method for deciding where to put blocks in multibanked caches. Divide the cache into banks, locations being address modulo number banks. That way if we have a request for sequential memory blocks, 90..94, we can serve up as many blocks as we have banks at once.
What is critical word first in caches?
Request the missed word first from memory and send it to the processor as soon as it arrives. Processor continues execution while filling the rest of the words in the block.
What is early restart in caches?
Fetch words in normal order, but as soon as the requested word arrives, send it to the processor so the proc can continue.
What is write merging in caches?
If a write buffer already contains data for an address that is being written, combine that data with the entry.
Describe the following compiler cache optimizations:
Loop interchange
Loop fusion
Blocking
Make the loop access words sequentially, not skip around

Take advantage of row major or column major order

Divide the accesses into blocks of size B
What is SRAM?

How many transistors per bit?

What happens in standby mode?

What does it emphasize?
static ram; don’t need to refresh so the access time is close to the cycle time.


Six transistors per bit to prevent info from being disturbed when read.

Needs only minimal power to maintain charge in standby mode.

Emphasizes speed. 8-16 times faster than DRAM and 8-16 times as expensive.
What is DRAM?

How many transistors per bit?

What does it emphasize?

How do you refresh the bits?

How much of the total time is used refreshing the bits?
dynamic ram; requires data be written back after being read. Requires cycle time be greater than access time so that address lines are stable between accesses. Also requires a refresh.

Use a single transistor to store a bit. Reading the bit destroys the info, so it must be restored. To prevent loss when not being read, the bit must be periodically refreshed by reading the row.

Emphasizes cost per bit and capacity. 4-8 times the capacity of SRAM. Organized as a rectangular matrix, strobe 1 is RAS, strobe 2 is CAS.

Reading the row refreshes all bits in that row. Number of steps in a refresh is the square root of the capacity (rows). Should be less than 5% of total time.
What is a DIMM?
dual inline memory module; contains 4-16 DRAMs organized to be 8 bytes wide in desktops.
What is fast page mode?
Repeatedly accessing the row buffer without another row access time.
What does Synchronous DRAM add?
Added clock signal to interface so that repeated transfers would not bear synchronization overhead.
What is DDR?
Double data rate. Transfers data on the rising and falling edge of the clock signal. Doubles peak data rate. Activates multiple banks internally.
PC2100 =
DDR266 =
133 MHz X 2 X 8 bytes or 2100 MB/sec

133 MHz DDR chip (transferring on both the rising and falling edge)
What is a Virtual Machine?

Is it safer than a full OS? Why?
An efficient, isolated duplicate of the real machine in complete control of system resources.

Safer because it is a smaller code base so that there are less bugs and higher security.
What is protection via virtual memory?

Is it safe?

What 4 things does it require from the OS?
Protects processes from each other.

Not safe enough because OS may have bugs.

1: two modes – user and OS processes (kernel or supervisor).
2: provide processor state that are read only for user processes.
3: provide mechanisms for processor to change from user to kernel access.
4: provide mechanisms for limiting mem access without swapping the process to disk on context switch. Usually done by adding protection to each page of VMem.
What is the process space?
a programs living space. The program itself plus any state needed.
What is a Translation lookaside buffer (TLB)?
Allows avoiding address translation during indexing of cache to reduce hit time. Translates virtual address to physical address to access memory.
What is a System virtual machine?

What are the 2 advantages?
VMs running ISAs that match the hardware.

Benefits include managing software (old OSes or Beta OSes), managing hardware (sharing hardware resources).
What is a VM Monitor (VMM)?

What service does it perform?
Software that supports VMs.

It presents a software interface to guest software, isolates the state of guests from each other, and protects itself from guest software. Behave as if it were running on native hardware (except performance).
What is virtualizable hardware?
Hardware that allows VMs to execute directly on hardware.
What is real/machine memory?

What are used to map virtual mem to real mem?

what maps real memory to physical memory?
The intermediate level between virtual mem and physical mem.

The guest OS's page tables

The VMM's page tables
What is a shadow page table?
It's a page table used by VMM to map directly from guest virtual address space to physical address space, skipping the intermediate real memory.
What is Paravirtualization?
Allowing small modifications to the guest OS to simplify virtualization. Eg. A guest OS could assume a real memory as large as its virtual memory so that no mem management is required by the guest OS.
Cache index =
2^index = cache size / (block size * set associativity) = 512 = 2^9 = 9 bit
What are the three disk metrics?
Bits PI, Tracks PI, Areal Density (BPSqI) = BPI * TPI
What is the Reliability of N Disks?
Reliability 1 Disk / N
How much power does a disk take?
Diamater^4.6 * RPM^2.8 * number platters
What is RAID?

Why is it more dependable?
Redundant array of inexpensive disks.

Can be more dependable because MTTF is in years and MTTR is in hours. Unless more than one disk fails within the MTTR, it should be fine.
What is RAID 0?
no redundancy, data may be striped across the disks
What is RAID 1?
mirroring or shadowing, two copies of every piece of data. May optimize read by reading parts from each disk, but may take longer for writes because both disks must have all data at all times.
What is RAID 2?
memory-style error correcting code in disks.
What is RAID 3?
High level interfaces figure out which disk failed. When a failure occurs, you “subtract” the good data from the good blocks, and what remains is the missing data (parity). Data is spread across all disks. Single parity disk.
What is RAID 4?
Allow each disk to perform independent small reads. Small writes are slower than small reads, but reads have low overhead as RAID 3. Single parity disk.
What is RAID 5?
Distributes parity info across all disks in the array, removing the bottleneck of raid 4 needing to read/write the same check disk.
What is RAID 6?
Uses two blocks per stripe of data and row diagonal parity to recover from more than 1 failure at a time. Recover along the diagonal, then the data recovered can be used to recover horizontally.
What is a storage failure?
When actual behavior deviates from specified behavior
What is a fault?
The cause of an error.
What is a latent error?
The error caused when the fault occurs (but not yet encountered)
What is an effective error?
When a latent error is activated
What is error latency?
The time between an error occurring (latent error) and when it becomes a failure (effective error)
What is a hardware fault?
A device failure (eg hit by an alpha particle)
What is a design fault?
Faults in software (occurs more than in hardware)
What is an operation fault?
Fault occurring from a mistake made by maintenance personnel
What is an environmental fault?
Fire, flood, earthquake, sabotage, etc
What is a transient fault?
Exists for a limited time and does not recur
What is an intermittent fault?
System oscillates between faulty and fault free
What is a permanent fault?
Fault that is not corrected over time
What is the storage response time?
queue + device service time
What does Linux emphasize in storage?

What does Solaris emphasize?

What does Windows emphasize?
performance over data availability (auto 1 hr recovery)

data availability over performance (auto 10 min recovery)

Favors neither (manual 23 min recovery)
What is Little's law?
mean number of tasks = arrival rate * mean response time
branch folding
For unconditional branches, when it's basically just a jump. Instead of evaluating the branch, just execute the target. This only works if all of the target instructions can be stored in the BTB.

When the branch-target buffer signals a hit and indicates that the branch is unconditional, the pipeline can simply substitute the instruction from the branch-
target buffer in place of the instruction that is returned from the cache (which is the unconditional branch).

The point of this is to achieve a 0-cycle branch.
branch target buffer
Buffer that contains the calculated targets of branches for if they are going to be taken. If the prediction is that the branch is taken, use the target (stored) PC to start fetching the next instructions.
Compulsory miss:
First access of data in a block
Capacity miss:
Kicked out because the cache is full
Conflict miss:
Due to set associativity, kick out because something else has taken the spot
Coherency miss:
Miss because of coherency mechanism (invalidated)
False sharing miss:
Set invalid because of a line in a shared block was replaced. So this line was set invalid as well.
True sharing miss:
Coherency miss where another Processor has written the line
TTS lock using exch
try: li R2,#1
lockit: lw R3,0(R1) ;load var
bnez R3,lockit ;not free=>spin
exch R2,0(R1) ;atomic exchange
bnez R2,try ;already locked?
LLSC Atomic Swap
try: mov R3,R4 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0), no store
mov R4,R2 ; put load value in R4
LLSC Fetch and Increment
try: ll R2,0(R1) ; load linked
addi R2,R2,#1 ; increment (OK if reg–reg)
sc R2,0(R1) ; store conditional
beqz R2,try ; branch store fails (R2 = 0)
Advantages and Disadvantages of the following ISAs
Answer:
2-bit Predictor Table
2-bit Predictor Table
Draw the 2-bit predictor diagram
2-bit predictor diagram
Pieces of FP Tomasulo Diagram:
1) Load buffers
2) FP Operation Queue
3) FP Registers
4) Store Buffers
5) Reservation Stations
6) FP Adders
7) FP Multipliers
8) CDB (Common Data Bus)
Tomasulo Diagram
kernel tests
small, key pieces of real applications; (better because they are real progs)
toy programs
100-line programs from beginning programming assignments, such as quicksort
synthetic benchmarks
fake programs invented to try to match the profile and behavior of real applications, such as Dhrystone
Normalized arithmetic mean
the average of the execution times, divide by a particular one to normalize
What is the Geometric mean formula?

2 reasons it is a good measure?

Why is it a bad measure?
for i = 1 to n (Mean *= sample[i]); mean = mean^1/n

consistent no matter which machine is the Base

alleviates the problems from outliers

not related to actual execution time, rewards easy enhancements. Reducing 2 to 1 = 200 to 100
Normalized geometric mean
the geometric mean of the programs normalized to a base machine
Speedup
1/(fraction enhanced/speedup enhanced + (1-fraction enhanced))

original time / new time

(num stages * num instructions) / (num stages + (num instructions – 1))
CPU time =
Instruction count * CC time * CPI
MIPS
Inst Count / (ExecTime * 106) = Clock rate/ (CPI * 106)

Accurate measure
Cost of die
Cost of wafer / (Dies per wafer × Die yield)
Cost of integrated circuit
(Cost of die + Cost of testing die + Cost of packaging and final test) / Final test yield
Die yield =
Wafer yield * (1 + (Defects per unit area × Die area) / a)-a )

a = 4