Use LEFT and RIGHT arrow keys to navigate between flashcards;
Use UP and DOWN arrow keys to flip the card;
H to show hint;
A reads text to speech;
60 Cards in this Set
- Front
- Back
SISD |
von Neumann architecture |
|
MISD |
Introduced fault tolerance |
|
SIMD |
Early parallelism Vector processing Takes advantage of data parallelism Synchronous Arch. for GPUs
|
|
Speedup |
SU(p) = T1/Tp Measures how much we're speeding up program execution with parallelization |
|
Linear speedup |
SU(p) = p Very unlikely, 'perfect' scalability |
|
Efficiency |
Ef(p) = SU(p)/p Measures how well we're utilizing hardware resources |
|
How do we represent a computation? |
Directed Acyclic Graph (DAG) |
|
Amdahl's Law |
Speedup will never be perfect
There will always be part of a program that must be serial
Maximum expected improvement based on the # of processor
T1 = W/r
Tp = (fW/r) + ((1 - f)W/pr)
|
|
True Error |
True Error = True Value - Approximate Value |
|
Relative True Error |
Relative True Error = True Error / True Value |
|
Approximate Error |
Approximate Error = Present Approximation - Previous Approximation |
|
Relative Approximate Error |
Relative Approximate Error = Approximate Error / Present Approximation
|
|
Round Off Error |
Caused by the computers ability to only computer a certain precision |
|
Truncation Error |
Caused by the programmer approximating a number or truncating a result |
|
Fosters Methodology |
Process to turn a serial program into a parallel one
Partitioning - Divide serial program into small tasks Communication - Identify communication needs between tasks Aggregation - Combine tasks where it makes sense, minimize communication Mapping - Map tasks to process/threads and share the load
Iterative process until we find the best design |
|
Gustafson-Barsis Law |
Serial time stays fixed, parallel time scales inversley to n
Scaled Speedup
|
|
MPI_Init |
Tells MPI to do necessary speedup, called before other MPI functions
Should only be called once |
|
MPI_Finalize |
Cleans up resources allocated for the program, called after all MPI functions are called
Should only be called once |
|
MPI_COMM_WORLD |
MPI communicator, collection of processes that can send messages to each other |
|
MPI_Comm_size |
MPI_Comm_size(MPI_COMM_WORLD, num_of_processes)
|
|
MPI_Comm_rank |
MPI_Comm_rank(MPI_COMM_WORLD, my_rank)
Gets the rank of the process and sticks it in my_rank |
|
MPI_Send |
MPI_send(message_buffer, message_size, message_type, destination, tag, communicator) |
|
MPI_Recv |
MPI_Recv(message)buffer, message_size, message_type, source, tag, communicator, status) |
|
MPI_Send & MPI_Recv message matching |
Tags & communicators much match on MPI_Send/MPI_Recv
MPI_Recv always blocks until a matching message is received |
|
What does OpenMPI stand for? |
Open Message Passing Interface |
|
Scalable Parallel System |
Efficiency can be kept constant as the number or processors increases |
|
Strongly Scalable System |
Keep efficiency fixed, add processors/threads without increasing the problem size |
|
Weakly Scalable System |
Keep efficiency fixed, increase problem size as we increase the number of processors/threads |
|
IEE Floating Point Standard |
32 bits
31st bit is the sign bit 30-23bits are the exponent, exponent affects range 22-0bits is the mantissa, affects precision |
|
Given an OpenMPI program running 4 processes, draw a a tree structured summation diagram. |
0 1 2 3 | / | / | .. / | ./
|
|
Point to Point Communications |
MPI_Send & MPI_Recv Matched on tags and communicators |
|
Collective Communication |
MPI_Reduce, MPI_Allreduce, MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Barrier
All processes call same collective function or else processes will crash or hang
No tags, matched on communicators + order in which they're called
Arguments passed must be "compatible", or else more hanging and crashing
|
|
Denormalized |
Zero exponent on floating point value
All Zeros or all ones denote special values Zero denormalized infinity NaN |
|
Round Off Error |
Determined by number of bits in mantissa |
|
Loss of Precision |
Adding two floating point numbers, must shift to align the radix
Number with the smaller exponent is denormalized to match larger exponent |
|
Add these two numbers, floating point style, show loss of precision: 1.250 x 10E2 2.000 x 10E-2 |
1.250|0x10E2 0.000|2x10E2
125.0 + 0.02 = 125.0 |
|
Machine Epsilon |
Defined as measure of accuracy Found by difference between 1 and the next number that can be represented
|
|
32 Bits for single Precission |
Value = -1^(signbit) * (1.mantissa) * 2^(exponent-127) |
|
Positive Overflow |
Trying to represent a number larger than the maximum positive representable value |
|
Negative overflow |
Trying to represent a number larger than the maximum negative representable value |
|
Positive Underflow |
Trying to represent a number smaller than the minimum positive representable value |
|
Negative Underflow |
Trying to represent a number smaller than the minimum negative representable value |
|
Add these two numbers, floating point style, show loss of precision: 125.68 125.31 |
1.256 x 10E2 1.253 x 10E2
0.003 x 10E2
0.3 |
|
Block Partitioning |
Assign blocks of consecutive components to each process |
|
Cyclic Partitioning |
Assign components in a round robin fashion |
|
Block-Cyclic Partitioning |
Use cyclic distribution of blocks of components |
|
Partition the following array in a block manner, cynclic manner, and block-cyclic manner across 2 processors. For block-cyclic use a block size of 2
array = [21, 22, 23, 24]
|
Block partitioning 0 | 21, 22 1 | 23, 24
Cyclic Partitioning 0 | 21, 23 1 | 22, 24
Block-Cyclic Partitioning 0 | 21, 22 1 | 23, 24 |
|
MPI_Scatter |
Used to distribute data to processes from an array or vector
Divides data into pieces
First piece goes to process 0
Preforms a BLOCK DISTROBUTION |
|
MPI_Gather |
Collects array data from different processes
Receives from a block distribution |
|
MPI_Allgather |
Both gathers and distributes data (butterfly structure)
|
|
Asynch Communication |
Non-blocking All MPI Communication calls have an asynchronous counter-part
|
|
Preformance evaluation |
Elapsed Time (Wall-block time), not CPU or System time
We want idle process time included if it's part of the parallel overhead
No I/O |
|
How do we get timing information with OpenMPI? |
MPI_Wtime - returns elapsed time in seconds
|
|
Give good and bad examples of data distribution. |
BAD - loop 10000 times calling MPI_Send with one element at a time
GOOD - call MPI_Send once with all 10000 elements |
|
What is BLAS? |
BLAS stands for Basic Linear Algebra Subprograms It is a standardized library of linear algebra routines
|
|
Give the different BLAS level standards |
Level 1 = vector to vector routines Level 2 = Matrix to vector routines Level 3 = Matrix to Matrix routines |
|
What is Odd Even Transposition Sort? List out what odd and even phases look like with an array. |
Variant of bubble sort, parallelized
Even phase: (a[0], a[1]), (a[2], a[3])
Odd phase: (a[1], a[2]), (a[3], a[4])
Assign a block of keys to each process Sort keys Have two processes exchange all keys
We might hang in deadlock
|
|
IJK forms |
Inner product model: ijk and jik
Middle product model: ijk and jki
Outer product model: kij and kji
|
|
SPMD |
Single program multiple data A single program contains all the code for all task can implement task and data parallelism |
|
Load balancing |
Each process and thread gets the same amount of work No execution unit should be idle
Synchronization & communication |