vadim suhomlinov improvement of multiline software

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Enhancing Quality of Multi-threaded software. Intel Threading Tools Use Cases: 

Enhancing Quality of Multi-threaded software. Intel Threading Tools Use Cases 2007 Vadim Sukhomlinov vadim.sukhomlinov@intel.com Denys Kotlyarov denys.kotlyarov@intel.com

Agenda: 

Agenda Why Multi-threading? Multithreading and SW Lifecycle Intel Threading Tools

Slide3: 

3 * Other brands and names may be claimed as the property of others. How to Double Performance and Doesn’t Burn? P0 ~ f² Core Die/Socket f

Slide4: 

4 * Other brands and names may be claimed as the property of others. 2000 2008+ Average SPECInt2000 of SPECFP2000 rates Relative Performance to 1.4Ghz Intel® Pentium® 4 Processor Источник: Intel 2004 3X Forecast PERFORMANCE Through Parallelism

Multicore is quickly becoming pervasive Single-threaded apps will be left behind: 

5 Multicore is quickly becoming pervasive Single-threaded apps will be left behind All products and dates are preliminary and subject to change without notice. * Source: IDC Desktop Performance Server Mobile Performance Projected run rate exiting the year. Source: Intel 2005 Multicore Shipping Multicore Shipping Multicore Shipping 2006 >70% >70% >85% 2007 >90% >90% ~100% Dual Core Quad Core

Slide6: 

6 Growing Availability of Multithreaded SW Activision (Ravensoft) Adobe Algorithmics Alias Autodesk Business Objects Cakewalk CodecPeople Computer Associates Corel (WordPerfect) Cyberlink Discreet IBM id Software Landmark Macromedia Mainconcept Maxon mental images Microsoft (Office Suite) Midway MSC Novell SUSE Oracle Pegasus Pinnacle Pixar (Renderman) Paradigm PTC SAP SAS Siebel CRM Signet Skype SLB SnapStream Sonic (Roxio) Sony Steinberg SunGard Sybase Symantec Thomson THQ Ubisoft UGS Valve Yahoo (Musicmatch) Multithreading as Competitive Advantage

What is Parallelism?: 

7 What is Parallelism? Two or more processes or threads execute at the same time Parallelism for threading architectures Multiple processes Communication through Inter-Process Communication (IPC) Single process, multiple threads Communication through shared memory

Threads – Benefits & Risks: 

8 Threads – Benefits & Risks Benefits Competitive advantage for Modern Software Increased performance and better resource utilization Even on single processor systems - for hiding latency and increasing throughput IPC through shared memory is more efficient Risks Increases complexity of the application Difficult to debug and test (data races, deadlocks, etc.)

Common Question for SW Designers: 

9 Common Question for SW Designers Where to thread? How long would it take to thread? How much re-design/effort is required? Is it worth threading a selected region? What should the expected speedup be? Will the performance meet expectations? Will it scale as more threads/data are added? Which threading model to use? Threading is Complex

Threading Impact to Software Lifecycle: 

Threading Impact to Software Lifecycle Requirements analysis and system specification Planning of properties, scalability System and software design Complicated architecture development. Implementation and unit testing New development paradigm, uncommon bugs & issues Integration, system verification and validation Quality assurance vs planned properties: scalability. Testing obstacles. Operation support and Maintenance Analysis & reproducing of customers issues, workload-specific performance bottlenecks, scalability degradation Disposal Multithreading – IS a design goal. Cost of Issues with Multithreading increases with moving to next phase

Create, Debug and Optimize Threaded Applications using Intel® Software Development Products: 

11 Create, Debug and Optimize Threaded Applications using Intel® Software Development Products Introduce Threads/ Design Correctness/ Debug Optimize/ Tune Leverage built-in threading support, and highly optimized threaded libraries that enable performance gains even if an application isn’t threaded! Detect even latent programming challenges unique to parallel programming Tune for performance and scalability. Visualize threading issues to help focus threading optimization. Analyze your application and identify multi-core performance bottlenecks and hotspots. Analysis Intel has a broad toolset to help develop fast, reliable threaded applications

Sequential Development Cycle: 

Sequential Development Cycle

Planning Scalability - Amdahl Law: 

13 Planning Scalability - Amdahl Law Upper bound of performance increase Serial Code limits Scalability n = 2 n = ∞

Sequential Development Cycle: 

Sequential Development Cycle

SW Design: Parallel Programming Models: 

15 SW Design: Parallel Programming Models Functional Decomposition Task parallelism Divide the computation, then associate the data Independent tasks of the same problem Data Decomposition Same operation performed on different data Divide data into pieces, then associate computation

Implementation: OpenMP Standard: 

16 Implementation: OpenMP Standard Fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally Sequential program evolves into a parallel program

Implementation: OpenMP Parallelization: 

Implementation: OpenMP Parallelization void test(int first, int last) { for (int i = first; i <= last; ++i) { a[i] = b[i] * c[i]; } } Each loop is independent; order of execution does not matter if(x < 0) a = foo(x); else a = x + 5; b = bat(y); c = baz(x + y); j = a*b+c; #pragma omp parallel for #pragma omp parallel sections { #pragma omp section if(x < 0) a = foo(x); else a = x + 5; #pragma omp section b = bat(y); #pragma omp section c = baz(x + y); } j = a+b+c; Assignments to ‘a’, ‘b’, and ‘c’ are independent

Sequential Development Cycle: 

Sequential Development Cycle

Implementation: OpenMP support in Visual C++: 

Implementation: OpenMP support in Visual C++ A specification for multithreaded programs It consists of a set of simple #pragmas and runtime routines #pragma omp parallel Most value, where? Parallelizing large loops with no loop-dependencies Intel C++/Fortran implements the full OpenMP 2.5 standard with task extensions http://www.openmp.org Visual C++ 2005 implements the full OpenMP 2.5 standard

Intel® Threading Building Blocks Scalable Threads Faster: 

20 Intel® Threading Building Blocks Scalable Threads Faster Описание Simplify threading for performance via a C++ template-based runtime library Использование Implementation aid: Easily introduce threading for utilizing multi-core platforms Performance aid: Use common algorithms tuned for performance and scalability Quality aid: Employ pre-packaged routines for common idioms and containers Design aid: Focus on higher level of abstraction via tasks and scalable patterns Поддержка Intel®, Microsoft* and GNU* Compilers APIs – OpenMP*, Windows* threads, POSIX* threads Special Tools support – Intel® Thread Checker and Intel® Thread Profiler Платформы

Less code to achieve parallelism Example: 2D Ray Tracing Application: 

21 Less code to achieve parallelism Example: 2D Ray Tracing Application Thread Setup and Initialization CRITICAL_SECTION MyMutex, MyMutex2, MyMutex3; int get_num_cpus (void) { SYSTEM_INFO si; GetSystemInfo(&si); return (int)si.dwNumberOfProcessors;} int nthreads = get_num_cpus (); HANDLE *threads = (HANDLE *) alloca (nthreads * sizeof (HANDLE)); InitializeCriticalSection (&MyMutex); InitializeCriticalSection (&MyMutex2); InitializeCriticalSection (&MyMutex3); for (int i = 0; i < nthreads; i++) { DWORD id; &threads[i] = CreateThread (NULL, 0, parallel_thread, i, 0, &id);} for (int i = 0; i < nthreads; i++) { WaitForSingleObject (&threads[i], INFINITE); } Parallel Task Scheduling and Execution const int MINPATCH = 150; const int DIVFACTOR = 2; typedef struct work_queue_entry_s { patch pch; struct work_queue_entry_s *next; } work_queue_entry_t; work_queue_entry_t *work_queue_head = NULL; work_queue_entry_t *work_queue_tail = NULL; void generate_work (patch* pchin) { int startx, stopx, starty, stopy; int xs,ys; startx=pchin->startx; stopx= pchin->stopx; starty=pchin->starty; stopy= pchin->stopy; if(((stopx-startx) >= MINPATCH) || ((stopy-starty) >= MINPATCH)) { int xpatchsize = (stopx-startx)/DIVFACTOR + 1; int ypatchsize = (stopy-starty)/DIVFACTOR + 1; for (ys=starty; ys<=stopy; ys+=ypatchsize) for (xs=startx; xs<=stopx; xs+=xpatchsize) { patch pch; pch.startx = xs; pch.starty = ys; pch.stopx = MIN(xs+xpatchsize-1,stopx); pch.stopy = MIN(ys+ypatchsize-1,stopy); generate_work (&pch);} } else { /* just trace this patch */ work_queue_entry_t *q = (work_queue_entry_t *) malloc (sizeof (work_queue_entry_t)); q->pch.starty = starty; q->pch.stopy = stopy; q->pch.startx = startx; q->pch.stopx = stopx; q->next = NULL; Thread Setup and Initialization #include "tbb/task_scheduler_init.h" #include "tbb/spin_mutex.h" tbb::task_scheduler_init init; tbb::spin_mutex MyMutex, MyMutex2; Parallel Task Scheduling and Execution #include "tbb/parallel_for.h" #include "tbb/blocked_range2d.h" class parallel_task { public: void operator() (const tbb::blocked_range2d<int> &r) const { for (int y = r.rows().begin(); y != r.rows().end(); ++y) { for (int x = r.cols().begin(); x != r.cols().end(); x++) { render_one_pixel (x, y); } } if (scene.displaymode == RT_DISPLAY_ENABLED) { tbb::spin_mutex::scoped_lock lock (MyMutex2); for (int y = r.rows().begin(); y != r.rows().end(); ++y) { GraphicsDrawRow(startx-1, y-1, totalx, (unsigned char *) &global_buffer[(y-starty)*totalx*3]); } } } parallel_task () {} }; parallel_for (tbb::blocked_range2d<int> (starty, stopy + 1, grain_size, startx, stopx + 1, grain_size), parallel_task ()); Windows Threads Intel® Threading Building Blocks if (work_queue_head == NULL) { work_queue_head = q; } else { work_queue_tail->next = q; } work_queue_tail = q; } } void generate_worklist (void) { patch pch; pch.startx = startx; pch.stopx = stopx; pch.starty = starty; pch.stopy = stopy; generate_work (&pch); } bool schedule_thread_work (patch &pch) { EnterCriticalSection (&MyMutex3); work_queue_entry_t *q = work_queue_head; if (q != NULL) { pch = q->pch; work_queue_head = work_queue_head->next; } LeaveCriticalSection (&MyMutex3); return (q != NULL); } generate_worklist (); void parallel_thread (void *arg) { patch pch; while (schedule_thread_work (pch)) { for (int y = pch.starty; y <= pch.stopy; y++) { for (int x=pch.startx; x<=pch.stopx; x++) { render_one_pixel (x, y);}} if (scene.displaymode == RT_DISPLAY_ENABLED) { EnterCriticalSection (&MyMutex3); for (int y = pch.starty; y <= pch.stopy; y++) { GraphicsDrawRow(pch.startx-1, y-1, pch.stopx-pch.startx+1, (unsigned char *) &global_buffer[((y-starty)*totalx+(pch.startx-startx))*3]); } LeaveCriticalSection (&MyMutex3); } } } This example includes software developed by John E. Stone. Focus on work to do, not “how” (thread control) to manage threads Intel® TBB offers cleaner Design, competitive performance and platform portability

Sequential Development Cycle: 

Sequential Development Cycle

Intel® Thread Checker 3.0 for Windows* and Linux* Create Threads Faster: 

23 Intel® Thread Checker 3.0 for Windows* and Linux* Create Threads Faster Detects challenging data races and deadlocks Pinpoints errors to the source code line Works on standard debug builds without recompiling Supports 32-bit and 64-bit applications Batch scripts integration for regression test runs Recommends modules to instrument by usage Minimize instrumentation overhead Windows Supports Microsoft Visual Studio 2005* Linux* Introduction of native Linux* support through command line views Intel Confidential – NDA Required New New New New New

Debugging for Correctness: 

24 Debugging for Correctness Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks Intel® Thread Checker VTune™ Performance Analyzer +DLLs (Instrumented) Binary Instrumentation Primes.exe Primes.exe (Instrumented) Runtime Data Collector threadchecker.thr (result file)

Slide25: 

25 PINPOINTS SOURCE CODE

Sequential Development Cycle: 

Sequential Development Cycle

Common Performance Issues: 

27 Common Performance Issues Parallel Overhead Due to thread creation, scheduling … Synchronization Excessive use of global data, contention for the same synchronization object Load Imbalance Improper distribution of parallel work Granularity No sufficient parallel work

Intel® Thread Profiler 3.0 for Windows* Optimize Threads Faster: 

28 Intel® Thread Profiler 3.0 for Windows* Optimize Threads Faster Key Benefits Shows how much of your application is not optimally parallel and where Identifies where thread specific overhead impacts performance Highlights thread workload imbalances and thread activity Shows the number of cores utilized Pinpoints issues to the source code line Maximizes application time spent in parallel regions Supports 32 and 64-bit applications Supports Microsoft Visual Studio 2005* Intel Confidential – NDA Required New New

Tuning for Performance: 

29 Tuning for Performance Thread Profiler pinpoints performance bottlenecks in threaded applications +DLL’s (Instrumented) Binary Instrumentation Primes.c Primes.exe (Instrumented) Runtime Data Collector Bistro.tp/guide.gvs (result file) Compiler Source Instrumentation Primes.exe /Qopenmp_profile

Intel® Thread Profiler: critical path analysis: 

30 Intel® Thread Profiler: critical path analysis Each duration on the critical path points to the single thread that limits program performance Time spend in transition between threads is the overhead time to switch and synchronize threads Decreasing duration of execution segments that relies on the critical path allows to improve application performance efficiency of system recourses utilization analysis

Evolutionary Development : 

Evolutionary Development Develop a system gradually in many repetitive stages: Increasing the knowledge of the system requirements and system functionality in each stage exposing the results to user comments. This can be achieved by using: The Iterative Model The Incremental Model The Prototyping Model Now we can get feedback from previous stage: Iterative Implementation

Iterative Model: 

Iterative Model

Incremental Model: 

Incremental Model

Prototyping Model: 

Prototyping Model

Spiral Model: 

Spiral Model

Summary: 

36 Summary Multithreading IS a competitive advantage Multithreading IS complex Multithreading impacts ALL phases of SW lifecycle Intel delivers several software developer products designed to make multi-threading easier and faster: Intel Thread Checker Intel Thread Profiler Intel Thread Building Blocks VTune Performance Analyzer Try the Intel Software developer tools today!

Slide37: 

37