OLS-2004-Slides

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

I would hate user space locking if it weren't that sexy...fusyn+RTNPTL: the making of a real-time synchronization infrastructure for Linuxhttp://developer.osdl.org/dev/robustmutexesInaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com> Boris Hu <boris.hu@intel.com>Salwan Searty <salwan.searty@intel.com>Adam Li <adam.li@intel.com>David P. Howell <david.p.howell@intel.com>Linux OS & Technology TeamIntel Corporation : 

I would hate user space locking if it weren't that sexy...fusyn+RTNPTL: the making of a real-time synchronization infrastructure for Linuxhttp://developer.osdl.org/dev/robustmutexesInaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com> Boris Hu <boris.hu@intel.com>Salwan Searty <salwan.searty@intel.com>Adam Li <adam.li@intel.com>David P. Howell <david.p.howell@intel.com>Linux OS & Technology TeamIntel Corporation

So this is... : 

So this is... I would hate user space locking... page 2 Intel Corporation—Linux OS & Technology Team—7/20/04 fusyn+RTNPTL steams off the OSDL CGL working group as a way to fill some of the POSIX real-time support gaps that Telecom manufacturers need to implement and port fault-proof systems: mutexes and condvars with futexes are FIFO when a mutex owner dies, waiters wait forever no priority inversion protection Provides a synchronization infrastructure (mutexes, conditional variables) that is: POSIX soft Real-Time compatible Robust Safe from / mitigates priority inversion. Available in the kernel and in user space

The basic kernel stuff : 

The basic kernel stuff I would hate user space locking... page 3 Intel Corporation—Linux OS & Technology Team—7/20/04 It all starts with a simple queue: queued waiters are sorted by priority. queue operations (addition, removal, location of highest priority node) are O(1) to insure scalability and minimize latency. if we change the priority of a waiter, it repositions in the queue. This cannot be done with futexes without scalability issues. To allow extensibility some functionality implemented in a fuqueue operations pointer. (*) Who wants to pop the joke about the name? spinlock queue ops

The fuqueue interface : 

The fuqueue interface I would hate user space locking... page 4 Intel Corporation—Linux OS & Technology Team—7/20/04 #include <linux/fuqueue.h> struct fuqueue fq = fuqueue_INIT (&fq); fuqueue_wait (&fq, &timeout); fuqueue_wake (&fq, 9, -EAGAIN); _wait() will block on fq—it will return the code passed to _wake(), -EINTR if interrupted by signals, -ETIMEDOUT if too much time went by. fuqueue_wait_fcb (&fq, &timeout, flags, cb); fuqueue_wake_state (&fq, 9, -EAGAIN, TASK_STOPPED, 1 /*sync*/); Pathetic attemp to mimic some of the waitqueue interfaces.

fuqueue: changing priority : 

fuqueue: changing priority I would hate user space locking... page 5 Intel Corporation—Linux OS & Technology Team—7/20/04 Hook in priority/policy manipulation interfaces: struct task_struct { ... struct fuqueue *fuqueue_wait; struct fuqueue_waiter *fuqueue_waiter; spinlock_t fuqueue_wait_lock; ... }; fuqueue_waiter_chprio (task, old_prio); Follow the link back to the fuqueue the task is waiting for, reposition on the list with queue chprio operation; optionally recurse depending on queue op (for priority inheritance).

fuqueue: cancelling wait : 

fuqueue: cancelling wait I would hate user space locking... page 6 Intel Corporation—Linux OS & Technology Team—7/20/04 Same: hook up fuqueue_waiter_cancel() in a bunch of places [process_timeout(), signalling, __wake_up_common()]—should be happening in try_to_wake_up(). Safely acquires locks and determines which fuqueue the task is waiting for, calls fuqueue-specific waiter_cancel() op from fuqueue->ops and cleans task_struct back-pointers. The spinlock protecting the fuqueue is an IRQ spinlock—cancels, as well as wakes can be called from IRQ contexts (eg: timeouts). It is necessary correctly implementing priority inheritance that the waker unqueues the wakee [and avoids re-taking a spinlock].

fuqueue: ops : 

fuqueue: ops I would hate user space locking... page 7 Intel Corporation—Linux OS & Technology Team—7/20/04 struct fuqueue_ops { void (* get) (struct fuqueue *); void (* put) (struct fuqueue *); unsigned (* waiter_cancel) (struct fuqueue *, struct fuqueue_waiter *); struct task_struct * (* waiter_chprio) ( struct task_struct *, struct fuqueue *, struct fuqueue_waiter *); }; Provides flexibility—the fuqueue layer provides basic operations [fuqueue_op_*()] that other layers (ufuqueues, fulocks, ufulocks) use. As well, most interface functions are broken up for easy reuse.

fulocks: chubby kernel mutexes : 

fulocks: chubby kernel mutexes I would hate user space locking... page 8 Intel Corporation—Linux OS & Technology Team—7/20/04 Take a fuqueue, add the concept of an owner, a node for a per-task ownership list, some flags and you have a fulock. fuqueue node flags owner fuqueue node flags owner fuqueue node flags owner

fulocks: locking : 

fulocks: locking I would hate user space locking... page 9 Intel Corporation—Linux OS & Technology Team—7/20/04 #include <linux/fulock.h> struct fulock fl = fulock_INIT (&fl); fl.flags = FULOCK_FL_PI | FULOCK_FL_RM; ... int result = fulock_lock (&fl, &timeout); After checking validity, take it if unlocked, set ownership to current and exit. Otherwise, use the fuqueue internal interface to queue a fuqueue_waiter in the fulock's fuqueue and wait on it. Returns 0 if you get the lock, -EOWNERDEAD if you got it, but some previous owner died, or other errors (-ETIMEDOUT on timeout, -EINTR on signal, etc...).

fulocks: unlocking : 

fulocks: unlocking I would hate user space locking... page 10 Intel Corporation—Linux OS & Technology Team—7/20/04 #include <linux/fulock.h> ... result = fulock_unlock (&fl, FULOCK_UNLOCK_AUTO); If there is an owner, reset ownership, get the first waiter from the fulock' s fuqueue (if any), decide how to unlock (type), unqueue (and maybe transfer ownership) and wake it up. Unlock type: serial: first waiter is made the new owner, ownership transfer parallel: first waiter is woken up and recontends for the lock auto: serial if the first waiter is a real-time task, parallel if not. Lock primitives automatically recontend for the lock in the parallel case.

fulocks: operations : 

fulocks: operations I would hate user space locking... page 11 Intel Corporation—Linux OS & Technology Team—7/20/04 struct fulock_ops { struct fuqueue_ops fuqueue; unsigned (* owner_set) (&fulock, &task); unsigned (* owner_reset) (&fulock); enum fulock_unlock_type (* unlock_type) ( &code, &fulock, &fuqueue_waiter, unlock_type, priv); void (* unqueue) (fulock, fuqueue_waiter, unlock_type, priv); void (* exit) (fulock); }; Provide flexibility to build on top. fuqueue_ops are reused.

fulocks: tracking ownership : 

fulocks: tracking ownership I would hate user space locking... page 12 Intel Corporation—Linux OS & Technology Team—7/20/04 struct task_struct { ... struct plist fulock_olist; spinlock_t fulock_olist_lock; }; __fulock_op_owner_set (struct fulock *, struct task_struct *); __fulock_op_owner_reset (struct fulock *); Add fulock to task->fulock_olist, set fulock->owner...or reset fulock->owner and remove from the list. The per-task ownership list is sorted by fulock priority.

fulocks: robustness : 

fulocks: robustness I would hate user space locking... page 13 Intel Corporation—Linux OS & Technology Team—7/20/04 Dies miserably

fulocks: robustness [how] : 

fulocks: robustness [how] I would hate user space locking... page 14 Intel Corporation—Linux OS & Technology Team—7/20/04 kernel/exit.c: do_exit() { ... exit_fulocks(); ... } For each owned fulock in current->fulock_olist, call ops->exit(). The basic fulock_op_exit() adds the FULOCK_FL_DEAD flag and unlocks serially. Waiter (or first future contender) will acquire it with -EOWNERDEAD. The key here is that the kernel knows at all times who is owning which fulocks.

fulocks: priority inheritance : 

fulocks: priority inheritance I would hate user space locking... page 15 Intel Corporation—Linux OS & Technology Team—7/20/04

fulocks: priority protection : 

fulocks: priority protection I would hate user space locking... page 16 Intel Corporation—Linux OS & Technology Team—7/20/04

PI & PP: the generic guts : 

PI & PP: the generic guts I would hate user space locking... page 17 Intel Corporation—Linux OS & Technology Team—7/20/04 Generic priority boosting facility in the scheduler: struct task_struct { ... int boost_prio; ... }; ... unsigned __prio_boost(task_t *p, int prio); Whenever we update the dynamic prio, do: task->prio = min(new_dyn_prio, task->boost_prio); This is a single hook into the effective_prio() hotpath, two other hooks in __setscheduler() and set_user_nice().

PI & PP: the glue—two rules : 

PI & PP: the glue—two rules I would hate user space locking... page 18 Intel Corporation—Linux OS & Technology Team—7/20/04 Priority inheritance: everytime we manipulate any list, propagate if the list top priority change—boost every task we hit while propagating to it's fulock_olist's priority. [__fulock_prio_update(), fuqueue_waiter_chprio() and __fulock_op_waiter_chprio()]. Priority protection: every time a task acquires a fulock, the olist_node of that fulock has the priority ceiling of the fulock. Goto 1). That's it. For this to happen without deadlocks, the unlocker has to unqueue the new locker and let it loose.

PI & PP: simple case : 

PI & PP: simple case I would hate user space locking... page 19 Intel Corporation—Linux OS & Technology Team—7/20/04

PI & PP: ownership-wait chain : 

PI & PP: ownership-wait chain I would hate user space locking... page 20 Intel Corporation—Linux OS & Technology Team—7/20/04 Wait here! Priority propagation is worst case O(n) on the depth of the chain—any other operation is O(1).

vlocator, ufu & vfuhabemus userspace : 

vlocator, ufu & vfuhabemus userspace I would hate user space locking... page 21 Intel Corporation—Linux OS & Technology Team—7/20/04 Need to provide a way for userspace to use this infrastructure. Use the futex method, generic way: vlocator: associate uaddr with kernel object through hash table, manage its life cycle, dispose it when its use count drops to zero. vl_key_create() vl_find() in hash table vl_find_or_create(): if not found, call vl_ops->alloc() + vl_ops->create(), add to hash, return it. vl_dispose(): detach from hash table, destroy when refcount == 0

{v,u}fuqueue: the futex wannabe : 

{v,u}fuqueue: the futex wannabe I would hate user space locking... page 22 Intel Corporation—Linux OS & Technology Team—7/20/04 Associates a vfuqueue (representation in user space) to a ufuqueue that is really a fuqueue. int sys_ufuqueue_wait (&vfuqueue, val, &timeout); int sys_ufuqueue_wake (&vfuqueue, howmany, code); int sys_ufuqueue_ctl (&vfuqueue, FUQUEUE_CTL_*); Imitates the futex behaviour, except for: extra error codes wake up is scheduling policy based responds to priority changes TODO: implement requeue

{v,u}fulock: the mutex : 

{v,u}fulock: the mutex I would hate user space locking... page 23 Intel Corporation—Linux OS & Technology Team—7/20/04 int sys_ufulock_lock (&vfulock, flags, &timeout); int sys_ufulock_unlock (&vfulock, flags, type); int sys_ufulock_requeue (&vfuqueue, val, &vfulock, flags); int sys_ufulock_ctl (&vfulock, flags,FULOCK_CTL*); Two modes of operation: KCO [Kernel Controlled Ownership]: slow, all arches Fast-path: waay faster, needs atomic cmpxchg

Fast-path: why bother the kernel? : 

Fast-path: why bother the kernel? I would hate user space locking... page 24 Intel Corporation—Linux OS & Technology Team—7/20/04 As in with futexes, there should be no need to go to the kernel if there is no contention: int vfulock_lock (&vfulock, flags, pid, &timeout) { unsigned old = VFULOCK_UNLOCKED; if (cmpxchg(vfulock,old,pid) != old) return 0; return SYSCALL(ufulock_lock,3,vfulock,flags,to); } That's your userspace lock operation. The vfulock state: vfulock == 0 UNLOCKED vfulock < VFULOCK_WP Fast locked [with PID] vfulock == VFULOCK_{WP,DEAD} Kernel rules [like KCO] vfulock == VFULOCK_NR Not recoverable; abort...

Slow-path: syncing up : 

Slow-path: syncing up I would hate user space locking... page 25 Intel Corporation—Linux OS & Technology Team—7/20/04 If fast-path enabled, need to find_or_create() a ufulock, sync it up and pass the operation down to the fulock layer. “State machine”: vfulock_sync_nonkco()—finds out what the kernel knows about the ufulock, what user space thinks about it and makes sure they are in sync. Fugly, but it works :) Yes Found task Not found task died dead-owner! No, ufulock is synced, owner known by kernel, waiters present Unlocked

syncing miscelanea : 

syncing miscelanea I would hate user space locking... page 26 Intel Corporation—Linux OS & Technology Team—7/20/04 Any sys_ufulock_*() function needs to sync first thing. sync state machine can detect and fix inconsistencies (when the kernel knows about the ufulock, it is easy, when not, it is a best guess). Helps preventing lock stealing that causes priority inversion. PIDs are reused—you can miss some process deaths and assign ownerships wrong. Workarounds/solutions for that and false positives: Increase your PID space / tasks ratio [default is 32k pids] PIDX: hash PID + task->start_time; kind of a lame task UUID—use that for the vfulock and task lookup. Look around task->mm, check if current is sharing the vfulock with that task.

Back to fast-path: unlocking : 

Back to fast-path: unlocking I would hate user space locking... page 27 Intel Corporation—Linux OS & Technology Team—7/20/04 Unlocking might be more complex: int vfulock_unlock (&vfulock, flags, unlock_type) { unsigned old_val = *vfulock; while (old_val < VFULOCK_WP) { if (cmpxchg(vfulock,old_val,VFULOCK_UNLOCKED) == old_val) return 0; old_val = *vfulock; } return SYSCALL(ufulock_unlock,3,vfulock, flags,unlock_type); } Same thing: try the fast path—if it fails, let the kernel handle it.

RTNPTL: the POSIX glue : 

RTNPTL: the POSIX glue I would hate user space locking... page 28 Intel Corporation—Linux OS & Technology Team—7/20/04 Modifications to NPTL to have mutexes and condvars take advantage of fusyn. Implements POSIX tags TPI, TPP (soon), robustness (non-POSIX extension). Improves compliance in other areas (unlock order by scheduling policy, for example). Injected at the lll_*() layer of glibc/NPTL. Defaults to the fastest combination of unlock mode and fulock type (parallel, fast-path enabled). Non-POSIX extensions allow to tweak fusyn specific parameters to suit user tastes.

Example: Priority Inheritance : 

Example: Priority Inheritance I would hate user space locking... page 29 Intel Corporation—Linux OS & Technology Team—7/20/04

Performance: Volanomark : 

Performance: Volanomark I would hate user space locking... page 30 Intel Corporation—Linux OS & Technology Team—7/20/04 Dual P3 933 MHz 256 KB cache, 512MB RAM Fedora2, gcc 3.3.3 20040412 (RH Linux 3.3.3-7) glibc: 2004-07-20 + RTNPTL 2.3 kernel: 2.6.5 + fusyn 2.3 JVM: IBMJava2-141 (-Xms64m -Xmx256m)

Performance: general : 

Performance: general I would hate user space locking... page 31 Intel Corporation—Linux OS & Technology Team—7/20/04 Considering the extra overhead [and that we take one too many spinlocks with IRQs down], we are doing fine... On a system identical to Volanomark's [with slightly older snaphosts], we saw: serialized unlock()→lock() slow-path latency, 60±10μs (idle system), 110±10μs (heavy network traffic) and 130±10μs (heavy network traffic + IDE traffic). on some informal tests, jitter increases over NPTL less or equal to 1μs; however, this was measured with the HRT patch, that gives a 10μs resolution. In fact, non-conclusive, or very small [if you want to believe]. General operation is about 10% slower than NPTL, depending on the stress case. We need to solve that.

Backup : 

Backup I would hate user space locking... page 32 Intel Corporation—Linux OS & Technology Team—7/20/04

KCO: the slow mode : 

KCO: the slow mode I would hate user space locking... page 33 Intel Corporation—Linux OS & Technology Team—7/20/04 ufulocks use the vfulock to maintain part of the health state [normal, dead-owner, not-recoverable] of the fulock (because it might be wiped in the kernel when refcount drops to zero). Step 1: synchronize what the kernel knows of the fulock with what user space says it should [vfulock_sync_kco()]. Step 2: operate on the fulock normally (calling the fulock layer)maintaining consistency with the vfulock value. All transactions go through the kernel. PP fulocks are always KCO [they manipulate task prio on every transition—only the kernel can do that].

Plans : 

Plans I would hate user space locking... page 34 Intel Corporation—Linux OS & Technology Team—7/20/04 Cleanups & optimizations [especially RTNPTL] If it is wanted and my manager is convinced, rw fulocks with robustness, PI and PP. Unlikely, this is completely not POSIX and presents many more challenges. Eliminate the hash-table lookup in the vlocator to avoid the last O(n) path...try different ideas. ... fix bugs ...