logging in or signing up CDP06 stoodley Pravez Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 37 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Avoiding Live Lock when Patching Codein Real-Time Execution Environments: Avoiding Live Lock when Patching Code in Real-Time Execution Environments Mark Stoodley Real-Time Java Compiler Development IBM Toronto Lab Outline: Outline Code patching background The live lock problem Two ways to avoid live lock Two examples Resolving a static field reference Updating target of virtual invocation cache Code patching: Code patching JIT compilers generate code designed to be modified during execution Resolution for classes, fields, methods Fill in virtual/interface invocation caches Lazy call target update after (re)compilation Fixups for virtual guards Typically performed by application threads via runtime helper Another thread may execute during modification Code patching is Hard: Code patching is Hard Multiple threads executing while patching Processors not designed to support it well Undocumented coherence requirements/loopholes Not designed to be fast Prevent execution of inconsistent instructions Strongly influenced by instruction set Atomic writes: how much can you change at once Code patching is Hard : Code patching is Hard Goal is quality code after patching Interacts with lots of other complex things Exception handling, stack walking Class loading and resolution rules Implementation induced complexities Result is usually a complex dance Careful design and layout of generated code Careful orchestration of steps Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) inc dword ptr[0h] Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) call Lsnippet db 00h Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Make sure first execution goes to snippet: generate 5B call instead of 6B inc Generate 5B call, but make space for 6B inc Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) call Lsnippet db 00h Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h After resolving static address, glue prepares to patch 6B instruction Resolves field to 088aa5ach Need to patch 6 bytes atomically: Three step process Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) jmp -2 db 00000002h Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Step 1: protect against multiple threads by patching self-loop 2-byte self loop (JMP -2) 2 bytes cannot cross patching boundary (8B on AMD64) After patching fence, these 4 bytes can now be patched with static address Patching fence = mfence, clflush, mfence Resolves field to 088aa5ach Example: Static field resolution (Intel x86): Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Example: Static field resolution (Intel x86) jmp -2 db 088aa5ach Step 2: write resolved static address in 4-byte field protected by self-loop 2-byte self loop (JMP -2) Write resolved static address (nonatomic) Then, patching fence to ensure all threads see address before self loop is removed Resolves field to 088aa5ach Example: Static field resolution (Intel x86): Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Example: Static field resolution (Intel x86) inc dword ptr[088aa5ah] Step 3: remove the self-loop and restore the original instruction bytes Benefits: thread-safe final code quality good BUT: uses self-loop Busy-wait loops can be BAD: Busy-wait loops can be BAD Employed for safety in code patching FIFO scheduling in Real-Time OS can result in live lock T1 (priority 10) T2 (priority 20) Resolve field ref Patch self-loop over instr Preempted by T2 T2 wakes up Tries to execute same field ref T2 stuck in self-loop T1 can never remove self-loop: live lock Busy-wait-free code patching: no live lock: Busy-wait-free code patching: no live lock Two basic approaches All threads do idempotent patch: let them all do it Cache line ping-pong effect may be slow(er) but correct Only one thread must patch: construct backup path Direct threads that arrive while patching to backup path Slower but correct execution Sometimes lowers resulting code quality Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[0h] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[0h] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue call Lsnippet 5B call generated explicitly ahead of the instruction to be resolved Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue call Lsnippet After resolving static address, glue patches the memory ref instruction Resolves field to 088aa5ach Note: any threads that reach the glue ALL patch the memory ref instruction BUT all threads will patch same value, so no races Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue call Lsnippet Now need to get rid of call to snippet, since ref has been resolved Patch a 5-byte NOP over the call: lea eax, ds:[eax] BUT: can’t do it atomically in one shot, need 3 steps again NOTE that any thread can now safely execute the memory reference instruction because it’s been patched Resolves field to 088aa5ach Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue jmp +3 db 000002h Step 1: patch short jump JMP +3 to memory ref instruction (lock cmpxchg) Resolves field to 088aa5ach Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue jmp +3 db 002044h Step 2: patch last three bytes of 5-byte NOP instruction Resolves field to 088aa5ach Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue lea eax, ds:[eax] Step 3: patch first 2 bytes of 5 byte NOP over the JMP +3 Benefits: thread-safe no live lock because no busy-waits BUT: 5-byte NOP residue hot code size increase Resolves field to 088aa5ach Example 2: Virtual invocation cache: Example 2: Virtual invocation cache Virtual invocation o.foo() Target method depends on class of receiver object o Full virtual dispatch uses lookup in o’s class virtual function table Expensive: indirection from object’s class For performance, use virtual invocation cache if (receiver class is C) call C.foo(); else call o.foo(); Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;C.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue 0xCCCCCCCC, 0x0f85, and 0xFDFDFDFD must be patched atomically to initialize cache Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue If target Ti not compiled when cache initialized, then patch new target Tc over Ti after C.foo() is compiled (actually next time called) Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue This cache cannot be placed so that none of these fields cross 8B patching boundary Example 2: Virtual invocation cache: Example 2: Virtual invocation cache If target not compiled yet, target written into cache is address of glue function Glue function looks at target: compiled yet? If not compiled, transition to interpreter If compiled, patch compiled target into cache Problem: can’t write entire target atomically Can atomically write first 2 bytes of call instruction Fancy footwork to avoid writing full target atomically Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Patching boundary can fall before 0f or after 85: same as if call didn’t need patching Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Patching has several steps so cannot allow multiple threads to proceed: establish backup path to full dispatch Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue First, clear out class pointer: effectively converts ‘jne’ into ‘jmp’ (atomic compare and exchange) Patching Fence Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip jmp -14 db TgTgTg Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Next, protect last 3 bytes of call instruction with JMP -14 (back to compare instruction) Patching Fence Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip jmp -14 db TcTcTc Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Now we can patch the last three bytes of the call with the new target Tc Patching Fence Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip call TcTcTcTc Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Remove the JMP -14 by putting the call instruction back Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call TcTcTcTc Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Finally, put the true class pointer back into the compare instruction Summary: Summary Modern JITs generate code that can patch itself via runtime helpers Helpers are complex,hand-written assembler Interactions with class loading, stack walking Busy-wait loops employed to prevent thread races Real-Time operating systems use FIFO scheduling Busy-wait loops can result in live lock Summary: Summary Avoid live lock with two techniques: If same value to be written, let all threads write it If only one thread can write, establish backup path first for all threads but one to use Two examples Unresolved static field reference Updating virtual invocation cache target when it has been (re)compiled Questions?: Questions? Mark Stoodley IBM Toronto Lab mstoodle@ca.ibm.com You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
CDP06 stoodley Pravez Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 37 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Avoiding Live Lock when Patching Codein Real-Time Execution Environments: Avoiding Live Lock when Patching Code in Real-Time Execution Environments Mark Stoodley Real-Time Java Compiler Development IBM Toronto Lab Outline: Outline Code patching background The live lock problem Two ways to avoid live lock Two examples Resolving a static field reference Updating target of virtual invocation cache Code patching: Code patching JIT compilers generate code designed to be modified during execution Resolution for classes, fields, methods Fill in virtual/interface invocation caches Lazy call target update after (re)compilation Fixups for virtual guards Typically performed by application threads via runtime helper Another thread may execute during modification Code patching is Hard: Code patching is Hard Multiple threads executing while patching Processors not designed to support it well Undocumented coherence requirements/loopholes Not designed to be fast Prevent execution of inconsistent instructions Strongly influenced by instruction set Atomic writes: how much can you change at once Code patching is Hard : Code patching is Hard Goal is quality code after patching Interacts with lots of other complex things Exception handling, stack walking Class loading and resolution rules Implementation induced complexities Result is usually a complex dance Careful design and layout of generated code Careful orchestration of steps Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) inc dword ptr[0h] Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) call Lsnippet db 00h Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Make sure first execution goes to snippet: generate 5B call instead of 6B inc Generate 5B call, but make space for 6B inc Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) call Lsnippet db 00h Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h After resolving static address, glue prepares to patch 6B instruction Resolves field to 088aa5ach Need to patch 6 bytes atomically: Three step process Example: Static field resolution (Intel x86): Example: Static field resolution (Intel x86) jmp -2 db 00000002h Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Step 1: protect against multiple threads by patching self-loop 2-byte self loop (JMP -2) 2 bytes cannot cross patching boundary (8B on AMD64) After patching fence, these 4 bytes can now be patched with static address Patching fence = mfence, clflush, mfence Resolves field to 088aa5ach Example: Static field resolution (Intel x86): Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Example: Static field resolution (Intel x86) jmp -2 db 088aa5ach Step 2: write resolved static address in 4-byte field protected by self-loop 2-byte self loop (JMP -2) Write resolved static address (nonatomic) Then, patching fence to ensure all threads see address before self loop is removed Resolves field to 088aa5ach Example: Static field resolution (Intel x86): Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue db 0ff05h Example: Static field resolution (Intel x86) inc dword ptr[088aa5ah] Step 3: remove the self-loop and restore the original instruction bytes Benefits: thread-safe final code quality good BUT: uses self-loop Busy-wait loops can be BAD: Busy-wait loops can be BAD Employed for safety in code patching FIFO scheduling in Real-Time OS can result in live lock T1 (priority 10) T2 (priority 20) Resolve field ref Patch self-loop over instr Preempted by T2 T2 wakes up Tries to execute same field ref T2 stuck in self-loop T1 can never remove self-loop: live lock Busy-wait-free code patching: no live lock: Busy-wait-free code patching: no live lock Two basic approaches All threads do idempotent patch: let them all do it Cache line ping-pong effect may be slow(er) but correct Only one thread must patch: construct backup path Direct threads that arrive while patching to backup path Slower but correct execution Sometimes lowers resulting code quality Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[0h] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[0h] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue call Lsnippet 5B call generated explicitly ahead of the instruction to be resolved Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue call Lsnippet After resolving static address, glue patches the memory ref instruction Resolves field to 088aa5ach Note: any threads that reach the glue ALL patch the memory ref instruction BUT all threads will patch same value, so no races Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue call Lsnippet Now need to get rid of call to snippet, since ref has been resolved Patch a 5-byte NOP over the call: lea eax, ds:[eax] BUT: can’t do it atomically in one shot, need 3 steps again NOTE that any thread can now safely execute the memory reference instruction because it’s been patched Resolves field to 088aa5ach Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue jmp +3 db 000002h Step 1: patch short jump JMP +3 to memory ref instruction (lock cmpxchg) Resolves field to 088aa5ach Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cp index push 08564ach ; const pool call unresolvedStaticGlue jmp +3 db 002044h Step 2: patch last three bytes of 5-byte NOP instruction Resolves field to 088aa5ach Example: Static field resolution, no livelock: Example: Static field resolution, no livelock inc dword ptr[088aa5ach] Lsnippet: push 024h ; cpIndex push 08564ach ; const pool call unresolvedStaticGlue lea eax, ds:[eax] Step 3: patch first 2 bytes of 5 byte NOP over the JMP +3 Benefits: thread-safe no live lock because no busy-waits BUT: 5-byte NOP residue hot code size increase Resolves field to 088aa5ach Example 2: Virtual invocation cache: Example 2: Virtual invocation cache Virtual invocation o.foo() Target method depends on class of receiver object o Full virtual dispatch uses lookup in o’s class virtual function table Expensive: indirection from object’s class For performance, use virtual invocation cache if (receiver class is C) call C.foo(); else call o.foo(); Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;C.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue 0xCCCCCCCC, 0x0f85, and 0xFDFDFDFD must be patched atomically to initialize cache Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue If target Ti not compiled when cache initialized, then patch new target Tc over Ti after C.foo() is compiled (actually next time called) Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue This cache cannot be placed so that none of these fields cross 8B patching boundary Example 2: Virtual invocation cache: Example 2: Virtual invocation cache If target not compiled yet, target written into cache is address of glue function Glue function looks at target: compiled yet? If not compiled, transition to interpreter If compiled, patch compiled target into cache Problem: can’t write entire target atomically Can atomically write first 2 bytes of call instruction Fancy footwork to avoid writing full target atomically Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Patching boundary can fall before 0f or after 85: same as if call didn’t need patching Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Patching has several steps so cannot allow multiple threads to proceed: establish backup path to full dispatch Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip call andlt;c.foo() entryandgt; Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue First, clear out class pointer: effectively converts ‘jne’ into ‘jmp’ (atomic compare and exchange) Patching Fence Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip jmp -14 db TgTgTg Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Next, protect last 3 bytes of call instruction with JMP -14 (back to compare instruction) Patching Fence Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip jmp -14 db TcTcTc Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Now we can patch the last three bytes of the call with the new target Tc Patching Fence Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, 0ffffffffh jne FullDispatchSnip call TcTcTcTc Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Remove the JMP -14 by putting the call instruction back Example 2: Virtual invocation cache: Example 2: Virtual invocation cache cmp ebx, andlt;CLASS Candgt; jne FullDispatchSnip call TcTcTcTc Continue: FullDispatchSnip: mov ecx, [ebx-andlt;VFT slotandgt;] call ecx jmp Continue Finally, put the true class pointer back into the compare instruction Summary: Summary Modern JITs generate code that can patch itself via runtime helpers Helpers are complex,hand-written assembler Interactions with class loading, stack walking Busy-wait loops employed to prevent thread races Real-Time operating systems use FIFO scheduling Busy-wait loops can result in live lock Summary: Summary Avoid live lock with two techniques: If same value to be written, let all threads write it If only one thread can write, establish backup path first for all threads but one to use Two examples Unresolved static field reference Updating virtual invocation cache target when it has been (re)compiled Questions?: Questions? Mark Stoodley IBM Toronto Lab mstoodle@ca.ibm.com