summaryrefslogtreecommitdiffstats
path: root/kernel/futex.c
Commit message (Collapse)AuthorAgeFilesLines
...
| * rtmutex: Simplify PI algorithm and make highest prio task get lockLai Jiangshan2011-01-271-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In current rtmutex, the pending owner may be boosted by the tasks in the rtmutex's waitlist when the pending owner is deboosted or a task in the waitlist is boosted. This boosting is unrelated, because the pending owner does not really take the rtmutex. It is not reasonable. Example. time1: A(high prio) onwers the rtmutex. B(mid prio) and C (low prio) in the waitlist. time2 A release the lock, B becomes the pending owner A(or other high prio task) continues to run. B's prio is lower than A, so B is just queued at the runqueue. time3 A or other high prio task sleeps, but we have passed some time The B and C's prio are changed in the period (time2 ~ time3) due to boosting or deboosting. Now C has the priority higher than B. ***Is it reasonable that C has to boost B and help B to get the rtmutex? NO!! I think, it is unrelated/unneed boosting before B really owns the rtmutex. We should give C a chance to beat B and win the rtmutex. This is the motivation of this patch. This patch *ensures* only the top waiter or higher priority task can take the lock. How? 1) we don't dequeue the top waiter when unlock, if the top waiter is changed, the old top waiter will fail and go to sleep again. 2) when requiring lock, it will get the lock when the lock is not taken and: there is no waiter OR higher priority than waiters OR it is top waiter. 3) In any time, the top waiter is changed, the top waiter will be woken up. The algorithm is much simpler than before, no pending owner, no boosting for pending owner. Other advantage of this patch: 1) The states of a rtmutex are reduced a half, easier to read the code. 2) the codes become shorter. 3) top waiter is not dequeued until it really take the lock: they will retain FIFO when it is stolen. Not advantage nor disadvantage 1) Even we may wakeup multiple waiters(any time when top waiter changed), we hardly cause "thundering herd", the number of wokenup task is likely 1 or very little. 2) two APIs are changed. rt_mutex_owner() will not return pending owner, it will return NULL when the top waiter is going to take the lock. rt_mutex_next_owner() always return the top waiter. will not return NULL if we have waiters because the top waiter is not dequeued. I have fixed the code that use these APIs. need updated after this patch is accepted 1) Document/* 2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4D3012D5.4060709@cn.fujitsu.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | futex: Deobfuscate handle_futex_death()Thomas Gleixner2011-03-141-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handle_futex_death() uses futex_atomic_cmpxchg_inatomic() without disabling page faults. That's ok, but totally non obvious. We don't hold locks so we actually can and want to fault here, because the get_user() before futex_atomic_cmpxchg_inatomic() does not guarantee a R/W mapping. We could just add a big fat comment to explain this, but actually changing the code so that the functionality is entirely clear is better. Use the helper function which disables page faults around the futex_atomic_cmpxchg_inatomic() and handle a fault with a call to fault_in_user_writeable() as all other places in the futex code do as well. Pointed-out-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <darren@dvhart.com> Cc: Michel Lespinasse <walken@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> LKML-Reference: <alpine.LFD.2.00.1103141126590.2787@localhost6.localdomain6> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | Merge branch 'tip/futex/devel' of ↵Thomas Gleixner2011-03-121-17/+23
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-rt into core/futexes futex,plist: Pass the real head of the priority list to plist_del() futex,plist: Remove debug lock assignment from plist_node plist: Shrink struct plist_head plist: Add priority list test
| * | futex,plist: Remove debug lock assignment from plist_nodeLai Jiangshan2011-03-111-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original code uses &plist_node->plist as the fake head of the priority list for plist_del(), these debug locks in the fake head are needed for CONFIG_DEBUG_PI_LIST. But now we always pass the real head to plist_del(), the debug locks in plist_node will not be used, so we remove these assignments. Acked-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4D10797E.7040803@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * | futex,plist: Pass the real head of the priority list to plist_del()Lai Jiangshan2011-03-111-8/+23
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some plist_del()s in kernel/futex.c are passed a faked head of the priority list. It does not fail because the current code does not require the real head in plist_del(). The current code of plist_del() just uses the head for checking, so it will not cause a bad result even when we use a faked head. But it is undocumented usage: /** * plist_del - Remove a @node from plist. * * @node: &struct plist_node pointer - entry to be removed * @head: &struct plist_head pointer - list head */ The document says that the @head is the "list head" head of the priority list. In futex code, several places use "plist_del(&q->list, &q->list.plist);", they pass a fake head. We need to fix them all. Thanks to Darren Hart for many suggestions. Acked-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4D11984A.5030203@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
* | futex: Sanitize cmpxchg_futex_value_locked APIMichel Lespinasse2011-03-111-30/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cmpxchg_futex_value_locked API was funny in that it returned either the original, user-exposed futex value OR an error code such as -EFAULT. This was confusing at best, and could be a source of livelocks in places that retry the cmpxchg_futex_value_locked after trying to fix the issue by running fault_in_user_writeable(). This change makes the cmpxchg_futex_value_locked API more similar to the get_futex_value_locked one, returning an error code and updating the original value through a reference argument. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile] Acked-by: Tony Luck <tony.luck@intel.com> [ia64] Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michal Simek <monstr@monstr.eu> [microblaze] Acked-by: David Howells <dhowells@redhat.com> [frv] Cc: Darren Hart <darren@dvhart.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110311024851.GC26122@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | futex: Avoid redudant evaluation of task_pid_vnr()Thomas Gleixner2011-03-111-8/+8
| | | | | | | | | | | | | | | | The result is not going to change under us, so no need to reevaluate this over and over. Seems to be a leftover from the mechanical mass conversion of task->pid to task_pid_vnr(tsk). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | futex: Update futex_wait_setup comments about lockingMichel Lespinasse2011-03-101-5/+6
|/ | | | | | | | | | | | | | Reviving a cleanup I had done about a year ago as part of a larger futex_set_wait proposal. Over the years, the locking of the hashed futex queue got improved, so that some of the "rare but normal" race conditions described in comments can't actually happen anymore. Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20110307020750.GA31188@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
*-. Merge branches 'core-fixes-for-linus', 'x86-fixes-for-linus', ↵Linus Torvalds2011-01-151-4/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'timers-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: avoid pointless blocked-task warnings rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi() * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, olpc: Add missing Kconfig dependencies x86, mrst: Set correct APB timer IRQ affinity for secondary cpu x86: tsc: Fix calibration refinement conditionals to avoid divide by zero x86, ia64, acpi: Clean up x86-ism in drivers/acpi/numa.c * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timekeeping: Make local variables static time: Rename misnamed minsec argument of clocks_calc_mult_shift() * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Remove syscall_exit_fields tracing: Only process module tracepoints once perf record: Add "nodelay" mode, disabled by default perf sched: Fix list of events, dropping unsupported ':r' modifier Revert "perf tools: Emit clearer message for sys_perf_event_open ENOENT return" perf top: Fix annotate segv perf evsel: Fix order of event list deletion
| * | rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi()Steven Rostedt2011-01-111-4/+3
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | The comment about why rt_mutex_next_owner() can return NULL in wake_futex_pi() is not the normal case. Tracing the cause of why this occurs is more likely that waiter simply timedout. But because it originally caused contention with the futex, the owner will go into the kernel when it unlocks the lock. Then it will hit this code path and rt_mutex_next_owner() will return NULL. Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* / thp: update futex compound knowledgeAndrea Arcangeli2011-01-131-10/+45
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Futex code is smarter than most other gup_fast O_DIRECT code and knows about the compound internals. However now doing a put_page(head_page) will not release the pin on the tail page taken by gup-fast, leading to all sort of refcounting bugchecks. Getting a stable head_page is a little tricky. page_head = page is there because if this is not a tail page it's also the page_head. Only in case this is a tail page, compound_head is called, otherwise it's guaranteed unnecessary. And if it's a tail page compound_head has to run atomically inside irq disabled section __get_user_pages_fast before returning. Otherwise ->first_page won't be a stable pointer. Disableing irq before __get_user_page_fast and releasing irq after running compound_head is needed because if __get_user_page_fast returns == 1, it means the huge pmd is established and cannot go away from under us. pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait for local_irq_enable before the IPI delivery can return. This means __split_huge_page_refcount can't be running from under us, and in turn when we run compound_head(page) we're not reading a dangling pointer from tailpage->first_page. Then after we get to stable head page, we are always safe to call compound_lock and after taking the compound lock on head page we can finally re-check if the page returned by gup-fast is still a tail page. in which case we're set and we didn't need to split the hugepage in order to take a futex on it. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: Add futex_q static initializerDarren Hart2010-11-101-15/+10
| | | | | | | | | | | | | | | | | | | | | | The futex_q struct has grown considerably over the last couple years. I believe it now merits a static initializer to avoid uninitialized data errors (having spent more time than I care to admit debugging an uninitialized q.bitset in an experimental new op code). With the key initializer built in, several of the FUTEX_KEY_INIT calls can be removed. V2: use a static variable instead of an init macro. use a C99 initializer and don't rely on variable ordering in the struct. V3: make futex_q_init const Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1289252428-18383-1-git-send-email-dvhart@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futex: Replace fshared and clockrt with combined flagsDarren Hart2010-11-101-76/+70
| | | | | | | | | | | | | | | | In the early days we passed the mmap sem around. That became the "int fshared" with the fast gup improvements. Then we added "int clockrt" in places. This patch unifies these options as "flags". [ tglx: Split out the stale fshared cleanup ] Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1289250609-16304-1-git-send-email-dvhart@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futex: Cleanup stale fshared flag interfacesThomas Gleixner2010-11-101-34/+30
| | | | | | | | | | | | | | The fast GUP changes stopped using the fshared flag in put_futex_keys(), but we kept the interface the same. Cleanup all stale users. This patch is split out from Darren Harts combo patch which also combines various flags. This way the changes are clearly separated. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhart@linux.intel.com> LKML-Reference: <1289250609-16304-1-git-send-email-dvhart@linux.intel.com>
* futex: Address compiler warnings in exit_robust_listDarren Hart2010-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 1dcc41bb (futex: Change 3rd arg of fetch_robust_entry() to unsigned int*) some gcc versions decided to emit the following warning: kernel/futex.c: In function ‘exit_robust_list’: kernel/futex.c:2492: warning: ‘next_pi’ may be used uninitialized in this function The commit did not introduce the warning as gcc should have warned before that commit as well. It's just gcc being silly. The code path really can't result in next_pi being unitialized (or should not), but let's keep the build clean. Annotate next_pi as an uninitialized_var. [ tglx: Addressed the same issue in futex_compat.c and massaged the changelog ] Signed-off-by: Darren Hart <dvhart@linux.intel.com> Tested-by: Matt Fleming <matt@console-pimps.org> Tested-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1288897200-13008-1-git-send-email-dvhart@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* new helper: ihold()Al Viro2010-10-251-1/+1
| | | | | | Clones an existing reference to inode; caller must already hold one. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'futexes-for-linus' of ↵Linus Torvalds2010-10-211-14/+21
|\ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Fix kernel-doc notation & typos futex: Add lock context annotations futex: Mark restart_block.futex.uaddr[2] __user futex: Change 3rd arg of fetch_robust_entry() to unsigned int*
| * futex: Fix kernel-doc notation & typosRandy Dunlap2010-10-141-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert futex_requeue() function parameters to use @name kernel-doc notation and add @fshared & @cmpval to prevent kernel-doc warnings. Add @list to struct futex_q. Fix a few typos. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20101013110234.89b06043.randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * futex: Add lock context annotationsNamhyung Kim2010-09-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | queue_lock/unlock/me() and unqueue_me_pi() grab/release spinlocks but are missing proper annotations. Add them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <1284468228-8723-3-git-send-email-namhyung@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * futex: Mark restart_block.futex.uaddr[2] __userNamhyung Kim2010-09-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | @uaddr and @uaddr2 fields in restart_block.futex are user pointers. Add __user and remove unnecessary casts. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <1284468228-8723-2-git-send-email-namhyung@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * futex: Change 3rd arg of fetch_robust_entry() to unsigned int*Namhyung Kim2010-09-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Sparse complains: kernel/futex.c:2495:59: warning: incorrect type in argument 3 (different signedness) Make 3rd argument of fetch_robust_entry() 'unsigned int'. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <1284468228-8723-1-git-send-email-namhyung@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | futex: Fix errors in nested key ref-countingDarren Hart2010-10-191-15/+16
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | futex_wait() is leaking key references due to futex_wait_setup() acquiring an additional reference via the queue_lock() routine. The nested key ref-counting has been masking bugs and complicating code analysis. queue_lock() is only called with a previously ref-counted key, so remove the additional ref-counting from the queue_(un)lock() functions. Also futex_wait_requeue_pi() drops one key reference too many in unqueue_me_pi(). Remove the key reference handling from unqueue_me_pi(). This was paired with a queue_lock() in futex_lock_pi(), so the count remains unchanged. Document remaining nested key ref-counting sites. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Reported-and-tested-by: Matthieu Fertré<matthieu.fertre@kerlabs.com> Reported-by: Louis Rilling<louis.rilling@kerlabs.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <4CBB17A8.70401@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org
* futex: futex_find_get_task remove credentails checkMichal Hocko2010-06-301-13/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | futex_find_get_task is currently used (through lookup_pi_state) from two contexts, futex_requeue and futex_lock_pi_atomic. None of the paths looks it needs the credentials check, though. Different (e)uids shouldn't matter at all because the only thing that is important for shared futex is the accessibility of the shared memory. The credentail check results in glibc assert failure or process hang (if glibc is compiled without assert support) for shared robust pthread mutex with priority inheritance if a process tries to lock already held lock owned by a process with a different euid: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed. The problem is that futex_lock_pi_atomic which is called when we try to lock already held lock checks the current holder (tid is stored in the futex value) to get the PI state. It uses lookup_pi_state which in turn gets task struct from futex_find_get_task. ESRCH is returned either when the task is not found or if credentials check fails. futex_lock_pi_atomic simply returns if it gets ESRCH. glibc code, however, doesn't expect that robust lock returns with ESRCH because it should get either success or owner died. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Nick Piggin <npiggin@suse.de> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* futex: Handle futex value corruption gracefullyThomas Gleixner2010-02-031-2/+19
| | | | | | | | | | | | | | | | | | | | | | The WARN_ON in lookup_pi_state which complains about a mismatch between pi_state->owner->pid and the pid which we retrieved from the user space futex is completely bogus. The code just emits the warning and then continues despite the fact that it detected an inconsistent state of the futex. A conveniant way for user space to spam the syslog. Replace the WARN_ON by a consistency check. If the values do not match return -EINVAL and let user space deal with the mess it created. This also fixes the missing task_pid_vnr() when we compare the pi_state->owner pid with the futex value. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>
* futex: Handle user space corruption gracefullyThomas Gleixner2010-02-031-0/+7
| | | | | | | | | | | | | | | | | | | | | | | If the owner of a PI futex dies we fix up the pi_state and set pi_state->owner to NULL. When a malicious or just sloppy programmed user space application sets the futex value to 0 e.g. by calling pthread_mutex_init(), then the futex can be acquired again. A new waiter manages to enqueue itself on the pi_state w/o damage, but on unlock the kernel dereferences pi_state->owner and oopses. Prevent this by checking pi_state->owner in the unlock path. If pi_state->owner is not current we know that user space manipulated the futex value. Ignore the mess and return -EINVAL. This catches the above case and also the case where a task hijacks the futex by setting the tid value and then tries to unlock it. Reported-by: Jermome Marchand <jmarchan@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <stable@kernel.org>
* futex_lock_pi() key refcnt fixMikael Pettersson2010-02-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a futex key reference count bug in futex_lock_pi(), where a key's reference count is incremented twice but decremented only once, causing the backing object to not be released. If the futex is created in a temporary file in an ext3 file system, this bug causes the file's inode to become an "undead" orphan, which causes an oops from a BUG_ON() in ext3_put_super() when the file system is unmounted. glibc's test suite is known to trigger this, see <http://bugzilla.kernel.org/show_bug.cgi?id=14256>. The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's 38d47c1b7075bd7ec3881141bb3629da58f88dab "[PATCH] futex: rely on get_user_pages() for shared futexes". That commit made get_futex_key() also increment the reference count of the futex key, and updated its callers to decrement the key's reference count before returning. Unfortunately the normal exit path in futex_lock_pi() wasn't corrected: the reference count is incremented by get_futex_key() and queue_lock(), but the normal exit path only decrements once, via unqueue_me_pi(). The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31 this is easily done by 'goto out_put_key' rather than 'goto out'. Signed-off-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@kernel.org>
* futexes: Remove rw parameter from get_futex_key()KOSAKI Motohiro2010-01-131-15/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, futexes have two problem: A) The current futex code doesn't handle private file mappings properly. get_futex_key() uses PageAnon() to distinguish file and anon, which can cause the following bad scenario: 1) thread-A call futex(private-mapping, FUTEX_WAIT), it sleeps on file mapping object. 2) thread-B writes a variable and it makes it cow. 3) thread-B calls futex(private-mapping, FUTEX_WAKE), it wakes up blocked thread on the anonymous page. (but it's nothing) B) Current futex code doesn't handle zero page properly. Read mode get_user_pages() can return zero page, but current futex code doesn't handle it at all. Then, zero page makes infinite loop internally. The solution is to use write mode get_user_page() always for page lookup. It prevents the lookup of both file page of private mappings and zero page. Performance concerns: Probaly very little, because glibc always initialize variables for futex before to call futex(). It means glibc users never see the overhead of this patch. Compatibility concerns: This patch has few compatibility issues. After this patch, FUTEX_WAIT require writable access to futex variables (read-only mappings makes EFAULT). But practically it's not a problem, glibc always initalizes variables for futexes explicitly - nobody uses read-only mappings. Reported-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Ulrich Drepper <drepper@gmail.com> LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* rtmutes: Convert rtmutex.lock to raw_spinlockThomas Gleixner2009-12-141-3/+3
| | | | | | | | | Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>
* sched: Convert pi_lock to raw_spinlockThomas Gleixner2009-12-141-19/+19
| | | | | | | | | Convert locks which cannot be sleeping locks in preempt-rt to raw_spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>
* plist: Make plist debugging raw_spinlock awareThomas Gleixner2009-12-141-3/+3
| | | | | | | | | plists are used with spinlocks and raw_spinlocks. Change the plist debugging to handle both types. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@elte.hu>
* futex: Take mmap_sem for get_user_pages in fault_in_user_writeableAndi Kleen2009-12-081-2/+8
| | | | | | | | | | | | | | get_user_pages() must be called with mmap_sem held. Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: stable@kernel.org Cc: Andrew Morton <akpm@linuxfoundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20091208121942.GA21298@basil.fritz.box> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futex: Fix spurious wakeup for requeue_pi reallyThomas Gleixner2009-10-281-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr == NULL test) nor does it use the wake_list of futex_wake() which where the reason for commit 41890f2 (futex: Handle spurious wake up) See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com> The changes in this fix to the wait_requeue_pi path were considered to be a likely unecessary, but harmless safety net. But it turns out that due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined as EAGAIN we built an endless loop in the code path which returns correctly EWOULDBLOCK. Spurious wakeups in wait_requeue_pi code path are unlikely so we do the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let it deal with the spurious wakeup. Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> LKML-Reference: <4AE23C74.1090502@us.ibm.com> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futex: Move drop_futex_key_refs out of spinlock'ed regionDarren Hart2009-10-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | When requeuing tasks from one futex to another, the reference held by the requeued task to the original futex location needs to be dropped eventually. Dropping the reference may ultimately lead to a call to "iput_final" and subsequently call into filesystem- specific code - which may be non-atomic. It is therefore safer to defer this drop operation until after the futex_hash_bucket spinlock has been dropped. Originally-From: Helge Bahmann <hcb@chaoticmind.net> Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: <stable@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> Cc: Sven-Thorsten Dietrich <sdietrich@novell.com> Cc: John Kacur <jkacur@redhat.com> LKML-Reference: <4AD7A298.5040802@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Check for NULL keys in match_futexDarren Hart2009-10-141-1/+2
| | | | | | | | | | | | | | | | | | | | If userspace tries to perform a requeue_pi on a non-requeue_pi waiter, it will find the futex_q->requeue_pi_key to be NULL and OOPS. Check for NULL in match_futex() instead of doing explicit NULL pointer checks on all call sites. While match_futex(NULL, NULL) returning false is a little odd, it's still correct as we expect valid key references. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Dinakar Guniguntala <dino@in.ibm.com> CC: John Stultz <johnstul@us.ibm.com> Cc: stable@kernel.org LKML-Reference: <4AD60687.10306@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futex: Handle spurious wake upThomas Gleixner2009-10-131-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | The futex code does not handle spurious wake up in futex_wait and futex_wait_requeue_pi. The code assumes that any wake up which was not caused by futex_wake / requeue or by a timeout was caused by a signal wake up and returns one of the syscall restart error codes. In case of a spurious wake up the signal delivery code which deals with the restart error codes is not invoked and we return that error code to user space. That causes applications which actually check the return codes to fail. Blaise reported that on preempt-rt a python test program run into a exception trap. -rt exposed that due to a built in spurious wake up accelerator :) Solve this by checking signal_pending(current) in the wake up path and handle the spurious wake up case w/o returning to user space. Reported-by: Blaise Gassend <blaise@willowgarage.com> Debugged-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org LKML-Reference: <new-submission>
* Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds2009-10-081-2/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: fix requeue_pi key imbalance futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions rcu: Place root rcu_node structure in separate lockdep class rcu: Make hot-unplugged CPU relinquish its own RCU callbacks rcu: Move rcu_barrier() to rcutree futex: Move exit_pi_state() call to release_mm() futex: Nullify robust lists after cleanup futex: Fix locking imbalance panic: Fix panic message visibility by calling bust_spinlocks(0) before dying rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function rcu: Clean up code based on review feedback from Josh Triplett, part 4 rcu: Clean up code based on review feedback from Josh Triplett, part 3 rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y rcu: Clean up code to address Ingo's checkpatch feedback rcu: Clean up code based on review feedback from Josh Triplett, part 2 rcu: Clean up code based on review feedback from Josh Triplett
| * futex: fix requeue_pi key imbalanceDarren Hart2009-10-071-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If futex_wait_requeue_pi() wakes prior to requeue, we drop the reference to the source futex_key twice, once in handle_early_requeue_pi_wakeup() and once on our way out. Remove the drop from the handle_early_requeue_pi_wakeup() and keep the get/drops together in futex_wait_requeue_pi(). Reported-by: Helge Bahmann <hcb@chaoticmind.net> Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Helge Bahmann <hcb@chaoticmind.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: stable-2.6.31 <stable@kernel.org> LKML-Reference: <4ACCE21E.5030805@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * futex: Fix locking imbalanceThomas Gleixner2009-10-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rich reported a lock imbalance in the futex code: http://bugzilla.kernel.org/show_bug.cgi?id=14288 It's caused by the displacement of the retry_private label in futex_wake_op(). The code unlocks the hash bucket locks in the error handling path and retries without locking them again which makes the next unlock fail. Move retry_private so we lock the hash bucket locks when we retry. Reported-by: Rich Ercolany <rercola@acm.jhu.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> Cc: stable-2.6.31 <stable@kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | futex: Add memory barrier commentary to futex_wait_queue_me()Darren Hart2009-09-241-0/+6
|/ | | | | | | | | | | | | The memory barrier semantics of futex_wait_queue_me() are non-obvious. Add some commentary to try and clarify it. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090924185447.694.38948.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Fix wakeup race by setting TASK_INTERRUPTIBLE before queue_me()Darren Hart2009-09-221-12/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PI futexes do not use the same plist_node_empty() test for wakeup. It was possible for the waiter (in futex_wait_requeue_pi()) to set TASK_INTERRUPTIBLE after the waker assigned the rtmutex to the waiter. The waiter would then note the plist was not empty and call schedule(). The task would not be found by any subsequeuent futex wakeups, resulting in a userspace hang. By moving the setting of TASK_INTERRUPTIBLE to before the call to queue_me(), the race with the waker is eliminated. Since we no longer call get_user() from within queue_me(), there is no need to delay the setting of TASK_INTERRUPTIBLE until after the call to queue_me(). The FUTEX_LOCK_PI operation is not affected as futex_lock_pi() relies entirely on the rtmutex code to handle schedule() and wakeup. The requeue PI code is affected because the waiter starts as a non-PI waiter and is woken on a PI futex. Remove the crusty old comment about holding spinlocks() across get_user() as we no longer do that. Correct the locking statement with a description of why the test is performed. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053038.8717.97838.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Correct futex_q woken state commentaryDarren Hart2009-09-221-16/+16
| | | | | | | | | | | | | | | | | | | Use kernel-doc format to describe struct futex_q. Correct the wakeup definition to eliminate the statement about waking the waiter between the plist_del() and the q->lock_ptr = 0. Note in the comment that PI futexes have a different definition of the woken state. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053029.8717.62798.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Make function kernel-doc commentary consistentDarren Hart2009-09-221-20/+21
| | | | | | | | | | | | | | | | | Make the existing function kernel-doc consistent throughout futex.c, following Documentation/kernel-doc-nano-howto.txt as closely as possible. When unsure, at least be consistent within futex.c. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053022.8717.13339.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Correct queue_me and unqueue_me commentaryDarren Hart2009-09-221-12/+29
| | | | | | | | | | | | | | The queue_me/unqueue_me commentary is oddly placed and out of date. Clean it up and correct the inaccurate bits. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922053015.8717.71713.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futex: Correct futex_wait_requeue_pi() commentaryDarren Hart2009-09-221-3/+3
| | | | | | | | | | | | | | Correct various typos and formatting inconsistencies in the commentary of futex_wait_requeue_pi(). Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090922052958.8717.21932.stgit@Aeon> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'core-futexes-for-linus' of ↵Linus Torvalds2009-09-111-17/+30
|\ | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-futexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Detect mismatched requeue targets futex: Correct futex_wait_requeue_pi() commentary
| * futex: Detect mismatched requeue targetsDarren Hart2009-08-161-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is currently no check to ensure that userspace uses the same futex requeue target (uaddr2) in futex_requeue() that the waiter used in futex_wait_requeue_pi(). A mismatch here could very unexpected results as the waiter assumes it either wakes on uaddr1 or uaddr2. We could detect this on wakeup in the waiter, but the cleanup is more intense after the improper requeue has occured. This patch stores the waiter's expected requeue target in a new requeue_pi_key pointer in the futex_q which futex_requeue() checks prior to attempting to do a proxy lock acquistion or a requeue when requeue_pi=1. If they don't match, return -EINVAL from futex_requeue, aborting the requeue of any remaining waiters. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Kacur <jkacur@redhat.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <20090814003650.14634.63916.stgit@Aeon> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * futex: Correct futex_wait_requeue_pi() commentaryDarren Hart2009-08-041-13/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | The state machine described in the comments wasn't updated with a follow-on fix. Address that and cleanup the corresponding commentary in the function. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <4A737C2A.9090001@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | futex: Fix handling of bad requeue syscall pairingDarren Hart2009-08-101-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If futex_requeue(requeue_pi=1) finds a futex_q that was created by a call other the futex_wait_requeue_pi(), the q.rt_waiter may be null. If so, this will result in an oops from the following call graph: futex_requeue() rt_mutex_start_proxy_lock() task_blocks_on_rt_mutex() waiter->task dereference OOPS We currently WARN_ON() if this is detected, clearly this is inadequate. If we detect a mispairing in futex_requeue(), bail out, seding -EINVAL to user-space. V2: Fix parenthesis warnings. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Kacur <jkacur@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> LKML-Reference: <4A7CA8C0.7010809@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | futex: Update futex_q lock_ptr on requeue proxy lockDarren Hart2009-08-101-4/+13
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | futex_requeue() can acquire the lock on behalf of a waiter early on or during the requeue loop if it is uncontended or in the event of a lock steal or owner died. On wakeup, the waiter (in futex_wait_requeue_pi()) cleans up the pi_state owner using the lock_ptr to protect against concurrent access to the pi_state. The pi_state is hung off futex_q's on the requeue target futex hash bucket so the lock_ptr needs to be updated accordingly. The problem manifested by triggering the WARN_ON in lookup_pi_state() about the pid != pi_state->owner->pid. With this patch, the pi_state is properly guarded against concurrent access via the requeue target hb lock. The astute reviewer may notice that there is a window of time between when futex_requeue() unlocks the hb locks and when futex_wait_requeue_pi() will acquire hb2->lock. During this time the pi_state and uval are not in sync with the underlying rtmutex owner (but the uval does indicate there are waiters, so no atomic changes will occur in userspace). However, this is not a problem. Should a contending thread enter lookup_pi_state() and acquire hb2->lock before the ownership is fixed up, it will find the pi_state hung off a waiter's (possibly the pending owner's) futex_q and block on the rtmutex. Once futex_wait_requeue_pi() fixes up the owner, it will also move the pi_state from the old owner's task->pi_state_list to its own. v3: Fix plist lock name for application to mainline (rather than -rt) Compile tested against tip/v2.6.31-rc5. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> LKML-Reference: <4A7F4EFF.6090903@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* futexes: Fix infinite loop in get_futex_key() on huge pageSonny Rao2009-07-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | get_futex_key() can infinitely loop if it is called on a virtual address that is within a huge page but not aligned to the beginning of that page. The call to get_user_pages_fast will return the struct page for a sub-page within the huge page and the check for page->mapping will always fail. The fix is to call compound_head on the page before checking that it's mapped. Signed-off-by: Sonny Rao <sonnyrao@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org Cc: anton@samba.org Cc: rajamony@us.ibm.com Cc: speight@us.ibm.com Cc: mstephen@us.ibm.com Cc: grimm@us.ibm.com Cc: mikey@ozlabs.au.ibm.com LKML-Reference: <20090710231313.GA23572@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>