• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/60

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

60 Cards in this Set

  • Front
  • Back

SISD

von Neumann architecture

MISD

Introduced fault tolerance

SIMD

Early parallelism


Vector processing


Takes advantage of data parallelism


Synchronous


Arch. for GPUs



Speedup

SU(p) = T1/Tp


Measures how much we're speeding up program execution with parallelization

Linear speedup

SU(p) = p


Very unlikely, 'perfect' scalability

Efficiency

Ef(p) = SU(p)/p


Measures how well we're utilizing hardware resources

How do we represent a computation?

Directed Acyclic Graph (DAG)

Amdahl's Law

Speedup will never be perfect



There will always be part of a program that must be serial



Maximum expected improvement based on the # of processor



T1 = W/r



Tp = (fW/r) + ((1 - f)W/pr)


True Error

True Error = True Value - Approximate Value

Relative True Error

Relative True Error = True Error / True Value

Approximate Error

Approximate Error = Present Approximation - Previous Approximation

Relative Approximate Error

Relative Approximate Error = Approximate Error / Present Approximation


Round Off Error

Caused by the computers ability to only computer a certain precision

Truncation Error

Caused by the programmer approximating a number or truncating a result

Fosters Methodology

Process to turn a serial program into a parallel one



Partitioning - Divide serial program into small tasks


Communication - Identify communication needs between tasks


Aggregation - Combine tasks where it makes sense, minimize communication


Mapping - Map tasks to process/threads and share the load



Iterative process until we find the best design

Gustafson-Barsis Law

Serial time stays fixed, parallel time scales inversley to n



Scaled Speedup




MPI_Init

Tells MPI to do necessary speedup, called before other MPI functions



Should only be called once

MPI_Finalize

Cleans up resources allocated for the program, called after all MPI functions are called



Should only be called once

MPI_COMM_WORLD

MPI communicator, collection of processes that can send messages to each other

MPI_Comm_size

MPI_Comm_size(MPI_COMM_WORLD, num_of_processes)



Used to get the number of processes our program is running

MPI_Comm_rank

MPI_Comm_rank(MPI_COMM_WORLD, my_rank)



Gets the rank of the process and sticks it in my_rank

MPI_Send

MPI_send(message_buffer, message_size, message_type, destination, tag, communicator)

MPI_Recv

MPI_Recv(message)buffer, message_size, message_type, source, tag, communicator, status)

MPI_Send & MPI_Recv message matching

Tags & communicators much match on MPI_Send/MPI_Recv



MPI_Recv always blocks until a matching message is received

What does OpenMPI stand for?

Open Message Passing Interface

Scalable Parallel System

Efficiency can be kept constant as the number or processors increases

Strongly Scalable System

Keep efficiency fixed, add processors/threads without increasing the problem size

Weakly Scalable System

Keep efficiency fixed, increase problem size as we increase the number of processors/threads

IEE Floating Point Standard

32 bits



31st bit is the sign bit


30-23bits are the exponent, exponent affects range


22-0bits is the mantissa, affects precision

Given an OpenMPI program running 4 processes, draw a a tree structured summation diagram.

0 1 2 3


| / | /


| .. /


| ./



Point to Point Communications

MPI_Send & MPI_Recv


Matched on tags and communicators

Collective Communication

MPI_Reduce, MPI_Allreduce, MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Barrier



All processes call same collective function or else processes will crash or hang



No tags, matched on communicators + order in which they're called




Arguments passed must be "compatible", or else more hanging and crashing


Denormalized

Zero exponent on floating point value



All Zeros or all ones denote special values


Zero


denormalized


infinity


NaN

Round Off Error

Determined by number of bits in mantissa

Loss of Precision

Adding two floating point numbers, must shift to align the radix



Number with the smaller exponent is denormalized to match larger exponent

Add these two numbers, floating point style, show loss of precision:


1.250 x 10E2


2.000 x 10E-2

1.250|0x10E2


0.000|2x10E2



125.0 + 0.02 = 125.0

Machine Epsilon

Defined as measure of accuracy


Found by difference between 1 and the next number that can be represented




32 Bits for single Precission

Value = -1^(signbit) * (1.mantissa) * 2^(exponent-127)

Positive Overflow

Trying to represent a number larger than the maximum positive representable value

Negative overflow

Trying to represent a number larger than the maximum negative representable value

Positive Underflow

Trying to represent a number smaller than the minimum positive representable value

Negative Underflow

Trying to represent a number smaller than the minimum negative representable value

Add these two numbers, floating point style, show loss of precision:


125.68


125.31

1.256 x 10E2


1.253 x 10E2



0.003 x 10E2



0.3

Block Partitioning

Assign blocks of consecutive components to each process

Cyclic Partitioning

Assign components in a round robin fashion

Block-Cyclic Partitioning

Use cyclic distribution of blocks of components

Partition the following array in a block manner, cynclic manner, and block-cyclic manner across 2 processors. For block-cyclic use a block size of 2



array = [21, 22, 23, 24]


Block partitioning


0 | 21, 22


1 | 23, 24



Cyclic Partitioning


0 | 21, 23


1 | 22, 24



Block-Cyclic Partitioning


0 | 21, 22


1 | 23, 24

MPI_Scatter

Used to distribute data to processes from an array or vector



Divides data into pieces



First piece goes to process 0



Preforms a BLOCK DISTROBUTION

MPI_Gather

Collects array data from different processes



Receives from a block distribution

MPI_Allgather

Both gathers and distributes data (butterfly structure)



Asynch Communication

Non-blocking


All MPI Communication calls have an asynchronous counter-part




Preformance evaluation

Elapsed Time (Wall-block time), not CPU or System time



We want idle process time included if it's part of the parallel overhead



No I/O

How do we get timing information with OpenMPI?

MPI_Wtime - returns elapsed time in seconds


Give good and bad examples of data distribution.

BAD - loop 10000 times calling MPI_Send with one element at a time



GOOD - call MPI_Send once with all 10000 elements

What is BLAS?

BLAS stands for Basic Linear Algebra Subprograms


It is a standardized library of linear algebra routines


Give the different BLAS level standards

Level 1 = vector to vector routines


Level 2 = Matrix to vector routines


Level 3 = Matrix to Matrix routines

What is Odd Even Transposition Sort? List out what odd and even phases look like with an array.

Variant of bubble sort, parallelized



Even phase:


(a[0], a[1]), (a[2], a[3])



Odd phase:


(a[1], a[2]), (a[3], a[4])



Assign a block of keys to each process


Sort keys


Have two processes exchange all keys



We might hang in deadlock






IJK forms

Inner product model:


ijk and jik



Middle product model:


ijk and jki



Outer product model:


kij and kji



SPMD

Single program multiple data


A single program contains all the code for all task


can implement task and data parallelism

Load balancing

Each process and thread gets the same amount of work


No execution unit should be idle



Synchronization & communication