Presentation Transcript
PERFORMANC EVALUATION: PERFORMANC EVALUATION S.Lakshmivarahan
School of Computer Science
University of Oklahoma
HYPE ABOUT SUPER/HYPER: HYPE ABOUT SUPER/HYPER Super star/model
Super market
Super bowl
Super sonic
Super computers
Every computer is a super computer of its time: Every computer is a super computer of its time Drum to core memory
Vacuum tubes to integrated circuits
Pipelined/vector arithmetic unit
Multiple processors - Architecture Technology
Modern era began in mid 1970’s: Modern era began in mid 1970’s CRAY-I – Los Alamos National Lab
Intel Scientific computers in early 1980’s
NCUBE
Alliant, Hitachi
Denelcor
Convex, SGI, IBM, Thinking Machine
Kendall Square
Today: super ~ parallel : Today: super ~ parallel Lot of hype in the in 1980’s
Parallelism will be everywhere and without it you will be in back waters
Analogy with helicopters
This prediction did not materialize: This prediction did not materialize Silent revolution from behind
Thanks to technology
By mid 1990’s workstations as powerful as CRAY-I was available for a fraction of the cost – from millions to a few ten thousands
Desk top computing became a reality
Many vendors went out of business
When the dust settled access to very powerful (CRAY like) desk top became a model for computing
Envelop of problems needing large scale processors was pushed far beyond what was conceived in mid 1980’s: Envelop of problems needing large scale processors was pushed far beyond what was conceived in mid 1980’s Lead to interesting debate in the 1990’s
“Super computing ain’t so super “– IEEE Computer 1994 by Ted Lewis
“Parallel Computing:Glory and Collapse”-IEEE Computer, 1994 by Borko Furht
“Parallel Computing is still not ready for mainstream” CACM, July 1997 by D. Talia
“Parallel Goes Populist” Byte May 1997 by D. Pountain
Our view:: Our view: Parallelism is here to stay, but not everyone needs it as was once thought.
Computing will continue in mixed mode- serial and parallel will coexist and complement each other
Used 128 processor Intel hypercube and Denelcor HEP-I at Los Alamos in 1984
Alliant in 1986
Cray J-90 and Hitachi in mid 1990’s
Several parallel clusters, the new class of machines
A comparison:: A comparison: IEEE Computer Society president made a calculation that goes like this: If only automobile industry did what computer industry has done to computers we should be able to buy Mercedes Benz for a few hundred dollars
Theme of this talk is Performance: Theme of this talk is Performance The question is: of what?
Machine/Algorithm/Installation
Raw power of the machine: Megaflop rating
Multiprogramming was invented to increase
machine/installation performance
Analogy with dentist office, major airline hubs
Parallel processing is more like open heart surgery
or building a home
Algorithm performance: Algorithm performance Total amount of work to be done measured in terms of the number of operations
Interdependency among various parts of the algorithm
Fraction of the work that is intrinsically serial
This fraction is a determinant in deciding the overall reduction in parallel time
Our Interest is solving problems: Our Interest is solving problems Performance of machine – algorithm combination
Every algorithm can be implemented serially but not all algorithms admits parallelism
A best serial algorithm may be a bad parallel algorithm and a throw away serial algorithm may turn out to be a very good parallel algorithm
Slide13: Dynamic
Network M M M P P P Static
Network P P P A Classification of Parallel Architectures Distributed Memory
Explicit Communication by send/receive
Packet switching- Post-office
Intel hypercube, Clusters
Packet Shared Memory
Indirect communication by read/write in shared memory
Circuit switching-Telephone
CARY-J90 data s d
Slide14: An example of parallel performance analysis
A simple Task done by one or two persons Speed up = Time by one person/Time by two persons
= 10/6 = 1.67 <2
1< speedup < 2
Total work done is 12 man hours in parallel as opposed to 10 man hours serially
Reduction in elapsed time at increased resources/cost
Slide15: Problem P of size N
Parallel Processor with p processors
Algorithm: a serial algorithm As and a parallel algorithm Ap
Let Ts (N) be the serial time and Tp(N) be the parallel time
Speedup = Sp(N)
= Ratio of the elapsed time by the best known
serial algorithm/ Time taken by the chosen
parallel algorithm
= Ts(N)/Tp(N)
1<= Sp(N)<= p, the number of processors
In the above example speed is 1.67 and p=2
Slide16: Processor Efficiency = Ep(N)
=Speedup per processor
= Sp(N)/p
0 < Ep(N) <= 1
In the above example, efficiency = 1.67/2= 0.835 Actual work done p Tp(N) Actual work done <= pTp(N)
Idle processors
Slide17: Redundancy = Rp(N)
= Ratio of the work done by the parallel algorithm
to the work done by the best serial algorithm
= Work done in parallel/ Ts(N)
> = 1
In the above example, Rp(N) = 12/10 =1.2 Desirable attributes: For a given p, N must be large
Speedup must be as close to p
Efficiency as close to 1
Redundancy is close to 1.
Slide18: Sp(N) Rp(N) Ep(N) 1.0 p 1 1 0 A View of the 3 Dimensional Performance Space
Slide19: CASE Study I
Find the sum of N numbers x1,x2,…xN using p processors This is a complete binary tree with 3 levels
1:8 = 1:4 + 5:8 ; 1:4 = 1:2 +3:4; 5:8 =5:6+7:8
Number of operations performed decreases with time
N=8 p=4
Ts(N) = 7
Tp(N) = 3 =log 8 1:4=1+2+3+4
Slide20: Generalize: Problem size N, Processor size p = N/2
Ts(N)=N-1, Tp(N) = log N
Sp(N) = N-1/log N ~ N/log N - increases with N: OK
Ep(N) = 2/ log N - decreases with with N : Bad
Rp(N) = 1 - best value
Why this? - Too many processors & progressively idle
Goal: Increase Ep(N) by decreasing p
Strategy: Keep p large but fixed and increase N
Fix the processor and scale the problem : Think Big
Slide21:
Case Study II
N=2**n and p=2**k with k
Slide22: Serial Time Ts(N) = N
Parallel time Tp(N) = N/p + log p
Speed up = N/ (N/p + logp)
Let N = Lplogp. Since p is fixed, increase L to increase N
Speedup = Lplogp/[(L+1) logp]
= p [L/(L+1)]
= p the best possible when L is large
Efficiency = L/(L+1) ~ 1 the best possible when L is large
Redundancy = 1
The scaling does the trick – “Think big”
Slide23: Case Study III: Impact of communication: N node ring
Processor k has the number xk. Find the sum
1 2 3 4 7 6 5 8
Slide24: tc tc tc tc 2tc 2tc 4tc A tree version of the implementation in a ring N/p log p
Slide25: ta - the computation time per operation (Unit cost model)
tc - the communication time between neighbors
Ts(8) = 7ta and Tp(8) = 3ta + (1+2+4)tc
= 3ta + 7tc
Speedup = 7ta / [ 3ta + 7tc]
= 7 / [3 + 7r] where r = tc / ta
Ratio r depends on ta - processor technology
tc – network technology
In the early days, r ~200-300. Check the value r first.
Writing a parallel program is like writing music for an orchestra
Slide26: We now provide a summary of our findings:
# of procs = 2**k = p < N = 2**n = problem size
Interconnection scheme: p node ring
N ta
Speed up = -----------------------------------
[N/p + log p]ta + (p-1) tc
[N/p + log p] ta – depends on the parallel algorithm
(p-1)tc –depends on the match between the algorithm and the parallel architecture and on the technology of the implementation of the network.
Slide27: This expression for speedup will change if we change the
following:
If we change the allocation of tasks to processors, it will change
the communication pattern and hence the speedup
The network of interconnection is changed – hypercube, toroid
Etc
If we change the algorithm for the problem
Slide28: Ring of p
processors Logical structure of the
parallel algorithm
Host graph
Topology of the network of processors
Guest graph What is the import of these case studies? Do these match? N/p log p p
Slide29: The key question: Can we make every pair of processors
who have to communicate be neighbors at all time?
This is seldom possible. The new question is: can we minimize
the communication demands of an algorithms?
The answer lies at the match between the chosen parallel
algorithm and the topology of interconnection between
processors of the available architecture.
The network of processors that is most suitable for your problem
may not be available to you. This miss-match translates into
more communication overhead
Your parallel program may give correct results but it may still
a lot more time compared to when there is good match
Slide30: Compiler Vectorization Multi-vector
Shared memory
machines Dusty
deck CRAY
ALLIANT Two modes of programming: Compiler Distributed
memory machines Dusty
deck To date there is no general purpose parallelizing complier
for distributed memory machines
HPF provides a partial solution but not very efficient yet
Slide31: The import is: we have to do it all from ground up
PVM / MPI just provide the communication libraries
-- FORTRAN and C versions
Send, Receive, Broadcast, scatter, Gather etc.
These are to be embedded in the program at the right places
Thus, the program has to specify the following:
Who gets what part of the data set?
What type of computations to be done with this data?
who talks to whom at what time ?
How much they communicate?
Who has the final results?
Cover rudiments of MPI in the next seminar
Slide32: The CS cluster – (16 +1) node Star interconnection Master node is at the center
Mater handles the I/O p1 p2 p3 p4 p8 p7 p6 p5 M
Slide33: References
S. Lakshmivarahan and S. K. Dhall [1990] Analysis and
Design of Parallel Algorithms, McGraw Hill.
S.Lakshmivarahan and S. K. Dhall [1994] Parallel Processing
Using the Prefix Problem, Oxford University Press.
Course Announcement: Spring 2003
CS 5613 Computer Networks and Distributed Processing
T-Th 5.00-615pm