• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/114

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

114 Cards in this Set

  • Front
  • Back

What is Computer Systems Architecture?

A system of hardware that, together with software components, meet functional, performance, and cost goals.

What is a the Harvard architecture, and some advantages and disadvantages?

The Harvard architecture has two memories and buses, one for program instructions and one for data.




The program and the data can be accessed in parallel, increasing speed of execution. Each memory can be optimized separately, with different cache structures, word sizes, etc.




The early disadvantage of the Harvard Architecture is that memory was costly. The circuitry required is also more complex.

What is the Von Neumann architecture?

Program instructions and data are stored in the same memory.

What are the three "walls" for computer performance (barriers to increasing performance) ?

ILP Wall


- Dependencies between instructions limit performance gains




Power Wall


- leaky transistors output too much heat to dissipate




Memory Wall


- processors execute instructions more quickly than they can be fetched from memory


- speeding up individual instructions does not improve performance due to memory bottleneck



What technique can be used to improve performance despite the "walls" to improving performance?

Multi-core computing

What is the difference between a homogeneous and a heterogeneous multicore system?

A homogeneous multicore system uses the same core design repeated multiple times, whereas a heterogeneous multicore system combines multiple specialized core designs.

What is an un-decidable or un-solvable problem?

A problem for which no algorithm exists that will work for every case.

What is an intractable problem?

A problem for which a solution exists, but is infeasible to execute due to memory or space requirements.

What is a tractable problem?

The problem has a solution, and execution of the solution is feasible.

What are the Big-O reduction rules?

Sum Rule: Ignore lower order terms in a sum




Scaling Rule: Coefficients may be ignored

What is Response Time?

The time delay between a "cause" event and the "effect" event

What is throughput?

A measure of related events over a time period.

How can CPU execution time be approximated?

t = (# of instructions) * (average cycles per instruction) * (seconds per cycle)




Seconds per cycle is 1 / (clock frequency)

How is Speedup calculated?

(Old Time) / (New Time)

How is Percent Speedup calculated?

(Old Time - New Time) / (New Time)

How can performance be improved by changing the number of clock cycles required for a program, the clock frequency / clock period?

- decrease the number of clock cycles required


- increase the frequency


- decrease the cycle period

How can MIPS be calculated?

(Millions of instructions) / (Execution Time)

What changes can lead to performance increases?

- increases in clock rate


- improvements that lower CPI for instructions


- improvements in architecture leading to higher throughput


- compiler optimizations that reduce CPI


- choice in algorithm

What is the locality principle?

Programs access a small portion of address space over small time intervals.

What is temporal locality?

An memory item that was just accessed is likely to be accessed again soon

What is spatial locality?

Memory items close to a referenced memory location are likely to be referenced soon.

What is the most common type of cache memory?

K-way set associative

What is write through?

A write always accesses main memory

What is read-through?

The CPU reads from main memory directly while the cache is being updated on a cache miss.

What fields are stored as management data in a cache?

Valid bit: is the line in use


Dirty bit: has the line been modified


Tag bit: Identifies the block in the cache

What is associative mapping ? What is a disadvantage of associative mapping?

A block can be stored in any line of the cache. Memory addresses are composed of tag bits and word bits. Tag bits identify the block, word bits identify the word. A disadvantage of a fully associative cache is that the full cache must be searched to determine if there is a cache hit.

What is direct mapping ? What is the advantage ?

A memory block only maps to a single cache line (multiple blocks map to the same line). Memory addresses contain:


Tag bits: identifies a block


Line bits: identifies the line that a block maps to


Word bits: identifies a word in a block


An advantage is that only one line must be checked to determine whether there is a cache hit

What is K-Way Set Associative mapping?

Cache is organized into sets of k lines each. Every address maps to one set, many addresses map to that set. A block can go into any cache line in the appropriate set. Associative mapping is used within the set




Addresses are composed of:


Word bits: identify the word in a block


Set bits: identify the set of cache lines that can hold the block


Tag bits: identifies a block

Why might we need to access memory multiple times to fill a cache line?

The memory interface may not support reading a full cache line at a time, eg. the cache line might be 128 bits but memory can only be read 8 bits at a time.

How is effective memory access time calculated?

teff = hitrate * tcache + (1-hitrate)*tmemory

What is the advantage of virtual memory?

Virtual memory allows programs that are bigger than main memory to be executed, through the use of paged memory management.

What is a TLB and what is it used for?

Translation Lookaside Buffer. A cache that maps virtual addresses to physical addresses in main memory.

What is segmentation? Advantages and disadvantages.

Programs are broken into flexible sized segments, logical program boundaries. This is good for taking advantage of locality, but leads to fragmentation and a complicated memory allocation procedure.

What is paging? Advantages and disadvantages.

Programs are broken into fixed size pages, makes memory allocation easier and reduced external fragmentation. Internally to a page, memory may be fragmented and space used inefficiently

What is a CISC computer? What are some consequences of a CISC computer?

A CISC architecture has many specialized instructions. Compilers must have a lot of logic to optimize code. Instructions are typically variable length, need varying number of cycles for execution. You end up with fewer instructions, but a high CPI.

What is a RISC computer? What are some consequences of a RISC computer?

A RISC computer has a smaller number of instructions, and restricted addressing modes for instructions. Instruction size is typically fixed, there is a greater number of instructions in a program, but lower CPI.

Why use RISC over CISC?

Allows you to focus on optimizing frequently use instructions. Simplifies compilers and processor implementation. Leads to pipelining.

What are the steps in a classic 5-stage pipeline?

- Instruction Fetch (IF)


- Instruction Dispatch (ID)


- Execute (EX)


- Memory Access (MEM)


- Writeback (WB)



How long does it take to execute N instructions in an M-stage pipelined processor?

Time = Time for 1 instruction + (N-1)T/M

What is the ideal theoretical speedup of an M-stage pipeline?

NT/(T + (N-1)T/M)


= M as n->inf

What is a pipeline stall?

When a stage can't complete the required operation, it needs to wait, no previous stages can move forward.

What is a prefetch queue and how can it improve performance?

If there is a delay in the fetch stage, instructions can continue to be taken out of the prefetch queue and sent through the pipeline. If there is a delay in the pipeline, instructions can be fetched before they are needed, reducing impact of cache misses later on.

What is a delayed branch?

Some instructions after the branch are always executed, even if the branch is taken. So, you can fill the slots after the delayed branch with useful instructions that have acceptable effects.

What are some issues with delayed branches?

Only unconditional branches are candidates for delayed branching. There may also be no candidate instructions to fill the delay slot



How does branch prediction work? How can it improve performance?

If a conditional branch appears at the end of a loop, it is likely that most times the branch is encountered, it will be taken to repeat the loop. Branch prediction makes use of this phenomenon. Hardware assumes the branch is either taken or not; based on this decision, it changes where instructions are fetched from. If it assumes the branch is taken, fetch instructions from after the branch. If it assumes not taken, continue fetching sequentially. This leads to a smaller branch penalty if the assumption is correct.

What is the difference between static branch prediction and dynamic branch prediction?

Static branch prediction simply assumes that a particular branch instruction is always either taken, or not. Dynamic branch prediction tracks execution history (branch history table) for a branch, and predicts that future executions of a branch will follow the previously recorded decision.

What is speculative execution? How does it work to improve performance?

Speculative execution is combined with branch prediction. The branch predictor guesses whether a branch will be taken or not, and fetches instructions based on that choice. Speculative execution is the execution of these instructions, which it may be incorrect to execute. To avoid corrupting the state of the processor, instructions are executed speculatively on "shadow registers", the effects of the instructions are not committed until the outcome of the branch is known.

What is the impact of pipelining on performance calculations?

The performance of pipelined processors depends very heavily on the dependencies between instructions. Thus, the performance of pipelined processors is very program-dependent. Benchmarks may be a good way to measure performance.

What is a superscalar architecture? How does it improve performance?

A processor with a superscalar architecture contains multiple execution units (ALU, memory controller). When one execution unit is busy, a different instruction may execute simulataneously in another execution unit. This allows the CPU to execute multiple instructions in the same cycle.

How does instruction fetch and dispatch work in a superscalar architecture?

Multiple instructions are fetched and decoded every cycle. The dispatch stage has a scheduling window, a group of instructions that are held until certain conditions are met for their execution (there are no dependencies on currently executing or waiting instructions, and there is a free execution unit). Instructions may be executed in a different order than they are written (out of order execution).

What is the difference between static and dynamic scheduling for out-of-order execution?

Static scheduling is done at compile time. The compiler needs to know about the specific CPU's execution units to optimize the code.




Dynamic scheduling is done at run time. The CPU analyzes the window of instructions and orders them based on the execution units and dependencies.

What is a common example of static scheduling?

Loop unrolling

What are the downsides of out-of-order execution?

In practice, the number of instructions per cycle (IPC) averages less than 2, due to dependencies between instructions. OOO execution makes hardware more complex, costly, and require more power. Recent processors are moving away from OOO.

What is a VLIW processor?

Very Long Instruction Word processors use a RISC and superscalar architecture. Instructions are very long, and contain multiple shorter instructions encoded into one word. At most one short instruction per execution unit. The short instructions are statically scheduled by a compiler.


This makes it easy for the processor to break up independent instructions, and there is no need to wait for dependent instructions.

What is a vector processor and how does it improve performance?

A vector processor is designed to improve performance of element-by-element computations of vectors or matrices. Computations such as matrix multiplication requires many operations with a single ALU. If there are many ALUs operating in parallel on a vector, the number of operations required can be greatly reduced.

What is a co-processor?

A coprocessor is designed to speed up specialized operations (FPU, vector processor, etc). Special coprocessor instructions are added to the instruction set, when the CPU encounters one of these instructions they are sent to the appropriate coprocessor. The host processor must wait while the coprocessor executed the instruction (TIGHT COUPLING).

What are the important pieces of parallel organization?

- Which parallel hardware components are available?


- How are the parallel components connected, and how do they interact?


- How is application information exchanged between concurrent flows


- How can the software be designed to account for the parallelism

What are two strategies for improving performance using concurrency?

- Break a single problem into small pieces; solve these independently and then assemble a solution




- Solve many unrelated problems in parallel

What is the Flynn taxonomy?

A waay to classify instructions applied to data.




SISD: Single instruction, single data. A single instruction stream is executed on a single data stream. Single-processor pipelining




SIMD: Single Instruction Multiple Data. A single instruction is executed on different data streams. Vector processor.




MISD: Multiple instruction single data. Multiple instructions executed on single data. not examples from core




MIMD: Multiple Instruction Multiple Data. Different instruction streams are executed on different data streams. Multiprocessing.





What is hyperthreading, why is it used?

Hyperthreading is the duplication of CPU state, it appears as though 2 logical processors exist on a single chip. Logical processors can run independently but share execution units. Symmetric Multi-Threading.

What is the maximum theoretical speedup for an M-processor computer?

M

WRT parallelism, what is granularity?

The level of detail at which parallelism is used. Fine-grained parallelism means the parallelism is expressed at a very low level (individual instructions)


Course-grained parallelism means the parallelism is at a higher level (entire tasks divided between processors)

WRT parallelism, what is coupling?

The ease and speed with which processors communicated and access shared resources. In a tightly coupled system, processors can communicate and access shared data very quickly.


In a loosely coupled system, communication and resource access is more complicated and slower.

What is the difference between shared-memory multiprocessors and distributed-memory multiprocessors?

Shared-memory multiprocessors share a common memory system, access over a common bus. Tightly coupled, leads to cache coherence problems.




Distributed memory multiprocessors each have a physically separate memory system, and communicate through a separate network. Loosely coupled, cache coherence not a problem.

What is cache coherence?

When multiple caches are present, the caches are coherent when they all contain the same value. Caches which contain an incorrect value are incoherent.

What is a write-back policy?

Data is written only to cache, main memory is updated when the cache block is replaced.

What is a write-through policy?

Both cache and memory are updated at the same time on a write.

What is cache snooping?

Each cache monitors the system bus and invalidates or updates itself when it notices activity for memory addresses that it is currently caching.

What are the tradeoffs involved in shared memory vs distributed memory?

Shared memory is easier to program for, and data management is easier. Distributed memory is easier to scale and has a lower load on the system bus.

What are symmetric multiprocessors?

- shared-memory systems, connect identical CPUs.


- hyperthreading is "logical" SMP


- Coarse-grained parallelism and tightly coupled


- One operating system, shared main memory and IO

What is a Massively parallel processor?

- distributed memory


- each CPU has own OS


- connected network of large number of symmetric processors

What is a cluster?

A tightly coupled, distributed-memory system. Provides course grained parallelism and is tightly coupled. The whole cluster works on a single problem. Central control of hardware.

What is a computing grid?

-loosely coupled, distributed memory. Coarse grained parallelism and loosely coupled. no central control of hardware.

What is the diameter of an interconnection network?

The number of connections on the shortest path between the two farthest nodes.

How many connections are there in a fully-connected network? A ring network?

fully connnected: 1/2(N)(N-1). Higheer cost


ring: N. Lower cost

Why might one create a logical network that sits on top of a physical network that is organized differently?

Physical connections use one topology, but the solution algorithm is designed for a different topology.

How should interconnection networks be compared?

Using the total number of connections, the degree of the nodes, and the diameter of the network.

What three design choices exist when designing a multiprocessing system?

- Interconnection network topology


- Interprocessor communication routing and format


- Software design

How can the time to transfer a message be calculated?

Tmessage = Tstartup + Tword*L




Tstartup is the time to acquire access to the network


Tword is the time to transfer each data word


L is the length of the message in words




If there are lots of short messages, the startup time dominates. For a few long messages, the word time dominates.

How does the complexity of message routing relate to the network topology?

Routing decision complexity at each node is related to the degree of the node. Static routing reduces the overhead of this routing. Dynamic routing can detect failures and avoid network congestion.

What is the speedup of a multiprocessor system of N processors?

Speedup = tsingle/Tn(n). Theoretically, speedup is N

What is the efficiency of a multiprocessor system of N processors?

Efficiency = Speedup / N

What is Amdahl's law?

Only a portion of the problem can be solved in parallel, some other portions must be executed sequentially.




speedup = Ts/(1-f)*Ts + f*Ts/n = 1/(1-f)+f/N




In practice, speedup is often word than this due to communication overhead, scheduling IO, load balancing, etc.

What is the Amdahl effect?

As the size of the problem increases, a larger fraction of the problem can be solved ion parallel.

What is a scalable algorithm?

The efficiency can be maintained as problem size is increased and processors are added.

What are characteristics of a shared-bus interconnection network?

- centralized control of centralized resources (Bus arbitration is required)


-tightly coupled and transaction oriented. Latency waiting for bus.


- Not scalable beyond 4 cores.

What are the characteristics of a crossbar interconnection network?

N processors are connected to M memories.


Crossbar has multiple independent, parallel connections. It is fast, and scalable




Crossbar is costly due to the high number of connections required




Delays are possible because 2 processors may want to access the same memory simultaneously. Normally better latency than system bus.

What are the characteristics of a ring interconnnection network?

- Simple routing, wiring is short.


- The average hop count is high, scalability is limited



What are the characteristics of a 2D mesh interconnection network?

A grid of routers, each router connected to neighbouring routers and a core with NIC. Each core may have a cache.

What is NUMA, what problems does it introduce?

NUMA is a design in which the access time to memory depends on a core's location relative to the memory interface. Caches and memory are viewed as a network of shared memory. A NUMA manager maintains a map of the contents of all caches. If a block is not in the local cache, the NUMA manager can retrieve the nearest copy.

How does cache-coherent NUMA work?

Each node's cache has its own NUMA controller. All of the NUMA controllers are then interconnected.

What are additional concerns for heterogeneous multicore systems?

The operating system must schedule tasks according to their affinity attribute, which identifies which processor is specialized for a specific piece of software.

What is a System on a Chip?

A complete computer system on a single chip. has at least 1 processor, memory, and IO.

What are the advantages and disadvantages of an SoC vs a Discrete Component System?

SoC


-reduced manufacturing costs


-higher reliability due to fewer solder points


-reduced physical footprint


-reduced power requirements




Discrete Components


-more customizable, SoC configuration is fixed at manufacturing time

What is JTAG?

JTAG allows connection to debugging circuitry

What are some examples of low power states?

Idle: stop executing instructions until next interrupt




Sleep: stop processor until interrupt from a subset of sources




Deep Sleep: stop processor and turn off all peripherals except RTC

What are the characteristics of RISC and CISC and how do they differ?

RISC


- small number of most useful instructions


- most frequently used instructions are optimized, all others eliminated


- fixed instruction size


- fixed instruction format


- addressing modes are restricted




CISC
- large number of specialized instructions


- variable instruction size


- variable number of cycles for instructions




Differences


- CPI for RISC is lower than CPI for CISC


- RISC instruction size is fixed, CISC is variable


- Instructions per program for RISC is high, CISC is low


- RISC has fewer instructions and variations on instructions

What are the relative advantages of CISC and RISC?

CISC


- gives a programmer many options with specialized instructions.


- More work for compilers, knowledge of specialized instructions and when to use them optimally is required


- The wide variability in instructions complicates the hardware




RISC


- easier to compile and optimize for, simplifies compilers


- microarchitecture can be generalized due to simplified implementation


- more instructions, need to be executed faster



Explain how a pipeline processor is different from a non-pipeline processor.

A pipeline processor executes many instructions in parallel, with each instruction at a different stage of execution, whereas a non pipeline processor executes one complete instruction at a time.

Explain how the classic 5-stage pipeline processor works?

Stages


1. IF - Instruction is fetched from instruction memory


2. ID - Instruction is decoded, register and operands obtained


3. EX - ALU operations performed


4. MEM - data is read from or written to memory


5. WB - registers are updated

Describe the information passed to each stage of a classic 5-stage pipeline for the instruction MOV R5, #10

ID: The fetched instruction passed in Instruction Register IR


EX: R5 passed to Dest register, value #10 passed to source operand


MEM: Dest reg, Source operands, WB operand


WB: Dest reg, Source operands, WB operand

Describe briefly why the ideal speedup is not a fair comparison of pipelined and non-pipelined processors.

In an un-pipelined processor, different instructions may have different execution times. It also assumes that all stages in the pipeline take the same amount of time, which is probably not true. It also does not take into account the stalls that occur in pipelines.

Explain why the ideal performance of an M-stage pipeline processor is not realized in reality.

- Ideal speedup assumes a long, uninterrupted sequence of instructions, which is unrealistic


- Requires each staage to take the same amount of time, which is again unrealistic


- Pipeline stalls may occur due to cache misses, instruction dependencies, and flushes due to branching

What is a data dependency between instructions and why is it relevant to a pipeline? Describe the corresponding stall for a classic 5-stage pipeline

A data dependency is when the operands used by one instruction are used in a following instruction. The second instruction must wait for the first instruction to complete before it can execute.




When a data dependency exists in the classic 5-stage pipeline, the dependent instruction must stall in ID stage until the first instruction passes the writeback (WB) stage, at which point it may move to the EX stage

What is a control dependency between instructions and why is it relevant to a pipeline? Describe the corresponding stall for a classic 5-stage pipeline.

A control dependency is when a branch must be executed in order to determine the address of the next instruction to fetch. This is relevant to a pipeline because while the branch instruction is in the pipeline, many more instructions may have been fetched and exist in the pipeline, which are not supposed to be executed.




For the 5-stage pipeline, this means that instructions fetched after the branch must be flushed from the pipeline if the branch is taken (replaced with nops)

What is a pre-fetch queue and how can it help to improve the performance of a pipeline?

A prefetch queue is a buffer of instructions that feeds into the Instruction Decode (ID) stage after the Instruction Fetch (ID) stage. When there is a stall after the ID stage in the pipeline, instructions can continue to be fetched. Then, if there is ever a stall reading instructions (cache miss, etc.) the pre-fetched instructions in the buffer can continue feeding the ID stage

What is a op bubble in a pipeline? Why do they sometimes occur?

A nop bubble is a series of nop instructions in the pipeline. No instruction-related activities occur during a stage with a nop instruction. They can occur when operations are stalled due to data dependencies, pipeline flushes due to branching.

Are the dependencies that result in pipeline stalls due to hardware or software attributes?

Dependencies are due to software attributes. Instruction sequencing causes dependency problems, which is a software issue

Why does a pipeline represent a step forward in instruction level parallelism?

A pipeline allows processing of multiple instructions in parallel.

Why are the branch penalties different for conditional vs unconditional branches?

For an unconditional branch, the branch penalty is always the number of instructions after the branch that have been fetched. A conditional branch may not always be taken, so there is the possibility of no branch penalty for a conditional branch.

Describe 3 different ways that branch penalties can be reduced.

- Even Earlier Execution of branches - try to take the branch in the ID stage. Allows earlier pipeline flush, so smaller branch penalty.




- Delayed branch - Make number of delay slots equal to number of stages in pipeline before the branch is taken.




- Branch prediction - Assume that the branch condition is true or false, start fetching instructions based on the assumption



What is the difference between a static and dynamic branch predictor?

A dynamic branch predictor uses the history for an instruction to guess whether the branch will be taken or not. A static branch predictor does not.

What problem does speculative execution introduce? Why was it possible (over time) to address this problem and enable speculative execution?

Speculative execution can result in unwanted side-effect being executed. It was possible to address this problem by introducing "shadow registers"

Explain why using a pipeline improves performance but complicated predicting numerical performance metrics. What is used to compensate for this complication and give useful metrics?

Performance of pipelines is highly dependent on instruction sequencing due to dependencies and branching. Thus the performance varies greatly based on the specific program and compiler used. To compensate, report the performance for multiple benchmark programs.