logging in or signing up 08 Tornado Gulkund Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: Embed: Flash iPad Dynamic Copy Does not support media & animations Automatically changes to Flash or non-Flash embed WordPress Embed Customize Embed URL: Copy Thumbnail: Copy The presentation is successfully added In Your Favorites. Views: 114 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 07, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Tornado: Minimizing Locality and Concurrency in a SMP OS: Tornado: Minimizing Locality and Concurrency in a SMP OS Slide2: http://www.eecg.toronto.edu/~okrieg/tmp.pdfWhy locality matters:: Why locality matters: Faster processors and more complex controllers -> higher memory latencies Write sharing costs Large secondary caches Large cache lines -> false sharing NUMA effects … Goal: Goal Minimize read/write and write sharing; -> minimize cache coherence overheads Minimize false sharing Minimize distance between accessing processor and target memory moduleDo real systems do this?: Do real systems do this? Yes and no Tornado -> adopt design principles to maximize locality and concurrency Map locality and independency which exists in the OS requests from applications into locality and independence in servicing these requests in the kernel or system servers Approach – re-think who data structures are organized and how operations on them are applied Counter ilustration: Counter ilustration Shared counter, array counter, padded counter Tornado basics: Tornado basics Individual resources in individual objects Mechanisms: Clustering objects Protected procedure calls Semi-automatic garbage collection / efficient lockingClustered objects: Clustered objects Appear as a single object Multiple “reps” assigned to handle object references from one (or more) processors Object = granularity of access Operations, synchronization can be applied only to relevant pieces Will make global policies more difficult (e.g., global paging policy) Implementation should reflect object useCluster Objects Implementation: Cluster Objects Implementation Mix of replication and partitioning techniques: Process Obj replicated, Regions distributed and created on demand… Combination of object migration, home rep, and other techniques (think distributed shared memory…) Translation tables to handle implementation Per processor to access local reps Global partitioned table across processors to find rep for given object Default “miss” handler May be quite large, but sparse -> let caching mechanisms help keep around only relevant pieces… Dynamic Memory Allocation: Dynamic Memory Allocation Local allocation – per “node” For small, less than cache-line data, use separate pool Addresses false sharing issue Avoid interrupt disabling by using efficient locks Protected procedure calls: Protected procedure calls Jumps into address space of a (server) object Microkernel design Client requests serviced on local processors (translation table) Handoff scheduling # server threads == # client threads Stub generator to generate code based on public interface Reference checking Special MetaPort to handle first use of a PPC Parameter passing Mix of registers, mapped stack or memory regions Cross-processor IPC Optimize so that caller spins in trapSynchronization: Synchronization They separate locking (for updates) & existence guarantees (deallocations) Encapsulate lock within object (better rep), avoid global locks Avoids contention, limits cache coherence operations on lock access Use spin-then-block locks Garbage Collection: Garbage Collection Essentially RCU Must ensure all persistent and temporary object references are removed Object/rep keeps track of requests made out to it – counter decremented on completion – so when counter is zero no temp references Since first use of object goes through translation table, can determine which processors have object reps, and can use a token scheme to ensure object ref counter is zero for each processor Finally – safe to dealloc objectEvaluation: Evaluation Use of NUMAchine and simulator NUMAchine – ring of 4 stations, each with 4 processors and a memory module, direct mapped caches Simulator different interconnect and cache coherence protocol First validate simulator is OK then use simulator to gather other data: First validate simulator is OK then use simulator to gather other data Effects of cluster objects: Effects of cluster objects Page faults frequent, region deletions aren’tSlide17: NUMAchine, SimOS and SimOS w/ 4-way assoc cacheCompared to other arch/OS, MT and MP mode: Compared to other arch/OS, MT and MP mode MT MP pagefault fstat thread You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
08 Tornado Gulkund Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: Embed: Flash iPad Dynamic Copy Does not support media & animations Automatically changes to Flash or non-Flash embed WordPress Embed Customize Embed URL: Copy Thumbnail: Copy The presentation is successfully added In Your Favorites. Views: 114 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 07, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Tornado: Minimizing Locality and Concurrency in a SMP OS: Tornado: Minimizing Locality and Concurrency in a SMP OS Slide2: http://www.eecg.toronto.edu/~okrieg/tmp.pdfWhy locality matters:: Why locality matters: Faster processors and more complex controllers -> higher memory latencies Write sharing costs Large secondary caches Large cache lines -> false sharing NUMA effects … Goal: Goal Minimize read/write and write sharing; -> minimize cache coherence overheads Minimize false sharing Minimize distance between accessing processor and target memory moduleDo real systems do this?: Do real systems do this? Yes and no Tornado -> adopt design principles to maximize locality and concurrency Map locality and independency which exists in the OS requests from applications into locality and independence in servicing these requests in the kernel or system servers Approach – re-think who data structures are organized and how operations on them are applied Counter ilustration: Counter ilustration Shared counter, array counter, padded counter Tornado basics: Tornado basics Individual resources in individual objects Mechanisms: Clustering objects Protected procedure calls Semi-automatic garbage collection / efficient lockingClustered objects: Clustered objects Appear as a single object Multiple “reps” assigned to handle object references from one (or more) processors Object = granularity of access Operations, synchronization can be applied only to relevant pieces Will make global policies more difficult (e.g., global paging policy) Implementation should reflect object useCluster Objects Implementation: Cluster Objects Implementation Mix of replication and partitioning techniques: Process Obj replicated, Regions distributed and created on demand… Combination of object migration, home rep, and other techniques (think distributed shared memory…) Translation tables to handle implementation Per processor to access local reps Global partitioned table across processors to find rep for given object Default “miss” handler May be quite large, but sparse -> let caching mechanisms help keep around only relevant pieces… Dynamic Memory Allocation: Dynamic Memory Allocation Local allocation – per “node” For small, less than cache-line data, use separate pool Addresses false sharing issue Avoid interrupt disabling by using efficient locks Protected procedure calls: Protected procedure calls Jumps into address space of a (server) object Microkernel design Client requests serviced on local processors (translation table) Handoff scheduling # server threads == # client threads Stub generator to generate code based on public interface Reference checking Special MetaPort to handle first use of a PPC Parameter passing Mix of registers, mapped stack or memory regions Cross-processor IPC Optimize so that caller spins in trapSynchronization: Synchronization They separate locking (for updates) & existence guarantees (deallocations) Encapsulate lock within object (better rep), avoid global locks Avoids contention, limits cache coherence operations on lock access Use spin-then-block locks Garbage Collection: Garbage Collection Essentially RCU Must ensure all persistent and temporary object references are removed Object/rep keeps track of requests made out to it – counter decremented on completion – so when counter is zero no temp references Since first use of object goes through translation table, can determine which processors have object reps, and can use a token scheme to ensure object ref counter is zero for each processor Finally – safe to dealloc objectEvaluation: Evaluation Use of NUMAchine and simulator NUMAchine – ring of 4 stations, each with 4 processors and a memory module, direct mapped caches Simulator different interconnect and cache coherence protocol First validate simulator is OK then use simulator to gather other data: First validate simulator is OK then use simulator to gather other data Effects of cluster objects: Effects of cluster objects Page faults frequent, region deletions aren’tSlide17: NUMAchine, SimOS and SimOS w/ 4-way assoc cacheCompared to other arch/OS, MT and MP mode: Compared to other arch/OS, MT and MP mode MT MP pagefault fstat thread