summaryrefslogtreecommitdiffstats
path: root/kernel/pid.c
Commit message (Collapse)AuthorAgeFilesLines
* net ip6 flowlabel: Make owner a union of struct pid * and kuid_tEric W. Biederman2012-08-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | Correct a long standing omission and use struct pid in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS. This guarantees we don't have issues when pid wraparound occurs. Use a kuid_t in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_USER to add user namespace support. In /proc/net/ip6_flowlabel capture the current pid namespace when opening the file and release the pid namespace when the file is closed ensuring we print the pid owner value that is meaning to the reader of the file. Similarly use from_kuid_munged to print uid values that are meaningful to the reader of the file. This requires exporting pid_nr_ns so that ipv6 can continue to built as a module. Yoiks what silliness Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* mm: add a low limit to alloc_large_system_hashTim Bird2012-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | UDP stack needs a minimum hash size value for proper operation and also uses alloc_large_system_hash() for proper NUMA distribution of its hash tables and automatic sizing depending on available system memory. On some low memory situations, udp_table_init() must ignore the alloc_large_system_hash() result and reallocs a bigger memory area. As we cannot easily free old hash table, we leak it and kmemleak can issue a warning. This patch adds a low limit parameter to alloc_large_system_hash() to solve this problem. We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table allocation. Reported-by: Mark Asselstine <mark.asselstine@windriver.com> Reported-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vfs: fix panic in __d_lookup() with high dentry hashtable countsDimitri Sivanich2012-02-131-2/+2
| | | | | | | | | | | | | | When the number of dentry cache hash table entries gets too high (2147483648 entries), as happens by default on a 16TB system, use of a signed integer in the dcache_init() initialization loop prevents the dentry_hashtable from getting initialized, causing a panic in __d_lookup(). Fix this in dcache_init() and similar areas. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* sysctl: add the kernel.ns_last_pid controlPavel Emelyanov2012-01-121-1/+3
| | | | | | | | | | | | | | | | | | | | The sysctl works on the current task's pid namespace, getting and setting its last_pid field. Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible to create a task with desired pid value. This ability is required badly for the checkpoint/restore in userspace. This approach suits all the parties for now. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel: Map most files to use export.h instead of module.hPaul Gortmaker2011-10-311-1/+1
| | | | | | | | | | | | | | | | The changed files were only including linux/module.h for the EXPORT_SYMBOL infrastructure, and nothing else. Revector them onto the isolated export header for faster compile times. Nothing to see here but a whole lot of instances of: -#include <linux/module.h> +#include <linux/export.h> This commit is only changing the kernel dir; next targets will probably be mm, fs, the arch dirs, etc. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* rcu: Restore checks for blocking in RCU read-side critical sectionsPaul E. McKenney2011-09-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Long ago, using TREE_RCU with PREEMPT would result in "scheduling while atomic" diagnostics if you blocked in an RCU read-side critical section. However, PREEMPT now implies TREE_PREEMPT_RCU, which defeats this diagnostic. This commit therefore adds a replacement diagnostic based on PROVE_RCU. Because rcu_lockdep_assert() and lockdep_rcu_dereference() are now being used for things that have nothing to do with rcu_dereference(), rename lockdep_rcu_dereference() to lockdep_rcu_suspicious() and add a third argument that is a string indicating what is suspicious. This third argument is passed in from a new third argument to rcu_lockdep_assert(). Update all calls to rcu_lockdep_assert() to add an informative third argument. Also, add a pair of rcu_lockdep_assert() calls from within rcu_note_context_switch(), one complaining if a context switch occurs in an RCU-bh read-side critical section and another complaining if a context switch occurs in an RCU-sched read-side critical section. These are present only if the PROVE_RCU kernel parameter is enabled. Finally, fix some checkpatch whitespace complaints in lockdep.c. Again, you must enable PROVE_RCU to see these new diagnostics. But you are enabling PROVE_RCU to check out new RCU uses in any case, aren't you? Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_checkMichal Hocko2011-07-081-1/+0
| | | | | | | | | | Since ca5ecddf (rcu: define __rcu address space modifier for sparse) rcu_dereference_check use rcu_read_lock_held as a part of condition automatically so callers do not have to do that as well. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* next_pidmap: fix overflow conditionLinus Torvalds2011-04-181-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | next_pidmap() just quietly accepted whatever 'last' pid that was passed in, which is not all that safe when one of the users is /proc. Admittedly the proc code should do some sanity checking on the range (and that will be the next commit), but that doesn't mean that the helper functions should just do that pidmap pointer arithmetic without checking the range of its arguments. So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1" doesn't really matter, the for-loop does check against the end of the pidmap array properly (it's only the actual pointer arithmetic overflow case we need to worry about, and going one bit beyond isn't going to overflow). [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ] Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Analyzed-by: Robert Święcki <robert@swiecki.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* export pid symbols needed for kvm_vcpu_on_spinRik van Riel2011-03-171-0/+2
| | | | | | | Export the symbols required for a race-free kvm_vcpu_on_spin. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* Add RCU check for find_task_by_vpid().Tetsuo Handa2010-08-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to commit 3120438 "rcu: Disable lockdep checking in RCU list-traversal primitives", we are currently unable to catch "find_task_by_vpid() with tasklist_lock held but RCU lock not held" errors due to the RCU-lockdep checks being suppressed in the RCU variants of the struct list_head traversals. This commit therefore places an explicit check for being in an RCU read-side critical section in find_task_by_pid_ns(). =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/pid.c:386 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 1 lock held by rc.sysinit/1102: #0: (tasklist_lock){.+.+..}, at: [<c1048340>] sys_setpgid+0x40/0x160 stack backtrace: Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty #1 Call Trace: [<c105e714>] lockdep_rcu_dereference+0x94/0xb0 [<c104b4cd>] find_task_by_pid_ns+0x6d/0x70 [<c104b4e8>] find_task_by_vpid+0x18/0x20 [<c1048347>] sys_setpgid+0x47/0x160 [<c1002b50>] sysenter_do_call+0x12/0x36 Commit updated to use a new rcu_lockdep_assert() exported API rather than the old internal __do_rcu_dereference(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* rculist: avoid __rcu annotationsArnd Bergmann2010-08-191-1/+1
| | | | | | | | | | | | | | | | This avoids warnings from missing __rcu annotations in the rculist implementation, making it possible to use the same lists in both RCU and non-RCU cases. We can add rculist annotations later, together with lockdep support for rculist, which is missing as well, but that may involve changing all the users. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* pids: alloc_pidmap: remove the unnecessary boundary checksOleg Nesterov2010-08-111-10/+7
| | | | | | | | | | | | | | | | | | | | | | alloc_pidmap() calculates max_scan so that if the initial offset != 0 we inspect the first map->page twice. This is correct, we want to find the unused bits < offset in this bitmap block. Add the comment. But it doesn't make any sense to stop the find_next_offset() loop when we are looking into this map->page for the second time. We have already already checked the bits >= offset during the first attempt, it is fine to do this again, no matter if we succeed this time or not. Remove this hard-to-understand code. It optimizes the very unlikely case when we are going to fail, but slows down the more likely case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Salman Qazi <sqazi@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pids: fix a race in pid generation that causes pids to be reused immediatelySalman2010-08-111-1/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A program that repeatedly forks and waits is susceptible to having the same pid repeated, especially when it competes with another instance of the same program. This is really bad for bash implementation. Furthermore, many shell scripts assume that pid numbers will not be used for some length of time. Race Description: A B // pid == offset == n // pid == offset == n + 1 test_and_set_bit(offset, map->page) test_and_set_bit(offset, map->page); pid_ns->last_pid = pid; pid_ns->last_pid = pid; // pid == n + 1 is freed (wait()) // Next fork()... last = pid_ns->last_pid; // == n pid = last + 1; Code to reproduce it (Running multiple instances is more effective): #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> // The distance mod 32768 between two pids, where the first pid is expected // to be smaller than the second. int PidDistance(pid_t first, pid_t second) { return (second + 32768 - first) % 32768; } int main(int argc, char* argv[]) { int failed = 0; pid_t last_pid = 0; int i; printf("%d\n", sizeof(pid_t)); for (i = 0; i < 10000000; ++i) { if (i % 32786 == 0) printf("Iter: %d\n", i/32768); int child_exit_code = i % 256; pid_t pid = fork(); if (pid == -1) { fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno); exit(1); } if (pid == 0) { // Child exit(child_exit_code); } else { // Parent if (i > 0) { int distance = PidDistance(last_pid, pid); if (distance == 0 || distance > 30000) { fprintf(stderr, "Unexpected pid sequence: previous fork: pid=%d, " "current fork: pid=%d for iteration=%d.\n", last_pid, pid, i); failed = 1; } } last_pid = pid; int status; int reaped = wait(&status); if (reaped != pid) { fprintf(stderr, "Wait return value: expected pid=%d, " "got %d, iteration %d\n", pid, reaped, i); failed = 1; } else if (WEXITSTATUS(status) != child_exit_code) { fprintf(stderr, "Unexpected exit status %x, iteration %d\n", WEXITSTATUS(status), i); failed = 1; } } } exit(failed); } Thanks to Ted Tso for the key ideas of this implementation. Signed-off-by: Salman Qazi <sqazi@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pids: increase pid_max based on num_possible_cpusHedi Berriche2010-05-271-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | On a system with a substantial number of processors, the early default pid_max of 32k will not be enough. A system with 1664 CPU's, there are 25163 processes started before the login prompt. It's estimated that with 2048 CPU's we will pass the 32k limit. With 4096, we'll reach that limit very early during the boot cycle, and processes would stall waiting for an available pid. This patch increases the early maximum number of pids available, and increases the minimum number of pids that can be set during runtime. [akpm@linux-foundation.org: fix warnings] Signed-off-by: Hedi Berriche <hedi@sgi.com> Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Robin Holt <holt@sgi.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Pavel Machek <pavel@ucw.cz> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: John Stoffel <john@stoffel.org> Cc: Jack Steiner <steiner@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds2010-03-131-1/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: locking: Make sparse work with inline spinlocks and rwlocks x86/mce: Fix RCU lockdep splats rcu: Increase RCU CPU stall timeouts if PROVE_RCU ftrace: Replace read_barrier_depends() with rcu_dereference_raw() rcu: Suppress RCU lockdep warnings during early boot rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare() rcu: Suppress __mpol_dup() false positive from RCU lockdep rcu: Make rcu_read_lock_sched_held() handle !PREEMPT rcu: Add control variables to lockdep_rcu_dereference() diagnostics rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU rcu: Use wrapper function instead of exporting tasklist_lock sched, rcu: Fix rcu_dereference() for RCU-lockdep rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU x86/gart: Unexport gart_iommu_aperture Fix trivial conflicts in kernel/trace/ftrace.c
| * rcu: Use wrapper function instead of exporting tasklist_lockPaul E. McKenney2010-03-041-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lockdep-RCU commit d11c563d exported tasklist_lock, which is not a good thing. This patch instead exports a function that uses lockdep to check whether tasklist_lock is held. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Christoph Hellwig <hch@lst.de> LKML-Reference: <1267631219-8713-1-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | kernel/pid.c: update comment on find_task_by_pid_nsTetsuo Handa2010-03-061-1/+1
|/ | | | | | | | | | | | | | | | | tasklist_lock does protect the task and its pid, it can't go away. The problem is that find_pid_ns() itself is unsafe without rcu lock, it can race with copy_process()->free_pid(any_pid). Protecting copy_process()->free_pid(any_pid) with tasklist_lock would make it possible to call find_task_by_pid_ns() under tasklist safely, but we don't do so because we are trying to get rid of the read_lock sites of tasklist_lock. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* sched: Use lockdep-based checking on rcu_dereference()Paul E. McKenney2010-02-251-1/+1
| | | | | | | | | | | | | | | | | | | | Update the rcu_dereference() usages to take advantage of the new lockdep-based checking. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com> [ -v2: fix allmodconfig missing symbol export build failure on x86 ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
* pid: reduce code size by using a pointer to iterate over arrayAndré Goddard Rosa2009-12-161-3/+2
| | | | | | | | | | | | | It decreases code size by 16 bytes on my gcc 4.4.1 on Core 2: text data bss dec hex filename 4314 2216 8 6538 198a kernel/pid.o-BEFORE 4298 2216 8 6522 197a kernel/pid.o-AFTER Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid: tighten pidmap spinlock critical section by removing kfree()André Goddard Rosa2009-12-161-3/+4
| | | | | | | | | | | | | Avoid calling kfree() under pidmap spinlock, calling it afterwards. Normally kfree() is fast, but sometimes it can be slow, so avoid calling it under the spinlock if we can do it. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: also use alloc_large_system_hash() for the PID hash tableJan Beulich2009-09-221-11/+4
| | | | | | | | | | | | | | | | This is being done by allowing boot time allocations to specify that they may want a sub-page sized amount of memory. Overall this seems more consistent with the other hash table allocations, and allows making two supposedly mm-only variables really mm-only (nr_{kernel,all}_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kmemleak: Remove alloc_bootmem annotations introduced in the pastCatalin Marinas2009-07-091-7/+0
| | | | | | | | | | kmemleak_alloc() calls were added in some places where alloc_bootmem was called. Since now kmemleak tracks bootmem allocations, these explicit calls should be run. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
* kmemleak: Inform kmemleak about pid_hashCatalin Marinas2009-06-291-0/+7
| | | | | | | | Kmemleak does not track alloc_bootmem calls but the pid_hash allocated in pidhash_init() would need to be scanned as it contains pointers to struct pid objects. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
* pids: clean up find_task_by_pid variantsChristoph Hellwig2009-06-181-14/+3
| | | | | | | | | | | | | | | | | | | | | | | find_task_by_pid_type_ns is only used to implement find_task_by_vpid and find_task_by_pid_ns, but both of them pass PIDTYPE_PID as first argument. So just fold find_task_by_pid_type_ns into find_task_by_pid_ns and use find_task_by_pid_ns to implement find_task_by_vpid. While we're at it also remove the exports for find_task_by_pid_ns and find_task_by_vpid - we don't have any modular callers left as the only modular caller of he old pre pid namespace find_task_by_pid (gfs2) was switched to pid_task which operates on a struct pid pointer instead of a pid_t. Given the confusion about pid_t values vs namespace that's generally the better option anyway and I think we're better of restricting modules to do it that way. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pids: refactor vnr/nr_ns helpers to make them safeOleg Nesterov2009-04-021-15/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inho, the safety rules for vnr/nr_ns helpers are horrible and buggy. task_pid_nr_ns(task) needs rcu/tasklist depending on task == current. As for "special" pids, vnr/nr_ns helpers always need rcu. However, if task != current, they are unsafe even under rcu lock, we can't trust task->group_leader without the special checks. And almost every helper has a callsite which needs a fix. Also, it is a bit annoying that the implementations of, say, task_pgrp_vnr() and task_pgrp_nr_ns() are not "symmetrical". This patch introduces the new helper, __task_pid_nr_ns(), which is always safe to use, and turns all other helpers into the trivial wrappers. After this I'll send another patch which converts task_tgid_xxx() as well, they're are a bit special. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pids: improve get_task_pid() to fix the unsafe sys_wait4()->task_pgrp()Oleg Nesterov2009-04-021-0/+2
| | | | | | | | | | | | | | | sys_wait4() does get_pid(task_pgrp(current)), this is not safe. We can add rcu lock/unlock around, but we already have get_task_pid() which can be improved to handle the special pids in more reliable manner. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <Louis.Rilling@kerlabs.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid: generalize task_active_pid_nsEric W. Biederman2009-01-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently task_active_pid_ns is not safe to call after a task becomes a zombie and exit_task_namespaces is called, as nsproxy becomes NULL. By reading the pid namespace from the pid of the task we can trivially solve this problem at the cost of one extra memory read in what should be the same cacheline as we read the namespace from. When moving things around I have made task_active_pid_ns out of line because keeping it in pid_namespace.h would require adding includes of pid.h and sched.h that I don't think we want. This change does make task_active_pid_ns unsafe to call during copy_process until we attach a pid on the task_struct which seems to be a reasonable trade off. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Bastian Blank <bastian@waldi.eu.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* trivial: fix then -> than typos in comments and documentationFrederik Schwarzer2009-01-061-1/+1
| | | | | | | - (better, more, bigger ...) then -> (...) than Signed-off-by: Frederik Schwarzer <schwarzerf@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* pidns: remove now unused find_pid function.Pavel Emelyanov2008-07-251-7/+1
| | | | | | | | | | | | | This one had the only users so far - the kill_proc, which is removed, so drop this (invalid in namespaced world) call too. And of course - erase all references on it from comments. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: misplaced export of find_get_pidDavid Sterba2008-07-251-1/+1
| | | | | | | | Move EXPORT_SYMBOL right after the func Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* rcu: split list.h and move rcu-protected lists into rculist.hFranck Bui-Huu2008-05-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | Move rcu-protected lists from list.h into a new header file rculist.h. This is done because list are a very used primitive structure all over the kernel and it's currently impossible to include other header files in this list.h without creating some circular dependencies. For example, list.h implements rcu-protected list and uses rcu_dereference() without including rcupdate.h. It actually compiles because users of rcu_dereference() are macros. Others RCU functions could be used too but aren't probably because of this. Therefore this patch creates rculist.h which includes rcupdates without to many changes/troubles. Signed-off-by: Franck Bui-Huu <fbuihuu@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Josh Triplett <josh@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* pids: introduce change_pid() helperOleg Nesterov2008-04-301-5/+16
| | | | | | | | | | | | | | | | | | | | | Based on Eric W. Biederman's idea. Without tasklist_lock held task_session()/task_pgrp() can return NULL if the caller races with setprgp()/setsid() which does detach_pid() + attach_pid(). This can happen even if task == current. Intoduce the new helper, change_pid(), which should be used instead. This way the caller always sees the special pid != NULL, either old or new. Also change the prototype of attach_pid(), it always returns 0 and nobody check the returned value. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pids: de_thread: don't clear session/pgrp pids for the old leaderOleg Nesterov2008-04-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on Eric W. Biederman's idea. Unless task == current, without tasklist_lock held task_session()/task_pgrp() can return NULL if the caller races with de_thread() which switches the group leader. Change transfer_pid() to not clear old->pids[type].pid for the old leader. This means that its .pid can point to "nowhere", but this is already true for sub-threads, and the old leader is not group_leader() any longer. IOW, with or without this change we can't trust task's special pids unless it is the group leader. With this change the following code rcu_read_lock(); task = find_task_by_xxx(); do_something(task_pgrp(task), task_session(task)); rcu_read_unlock(); can't race with exec and hit the NULL pid. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Deprecate find_task_by_pid()Pavel Emelyanov2008-04-301-6/+0
| | | | | | | | | | | | | | | | | | | There are some places that are known to operate on tasks' global pids only: * the rest_init() call (called on boot) * the kgdb's getthread * the create_kthread() (since the kthread is run in init ns) So use the find_task_by_pid_ns(..., &init_pid_ns) there and schedule the find_task_by_pid for removal. [sukadev@us.ibm.com: Fix warning in kernel/pid.c] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* free_pidmap: turn it into free_pidmap(struct upid *)Oleg Nesterov2008-04-301-6/+7
| | | | | | | | | | | | | | | The callers of free_pidmap() pass 2 members of "struct upid", we can just pass "struct upid *" instead. Shaves off 10 bytes from pid.o. Also, simplify the alloc_pid's "out_free:" error path a little bit. This way it looks more clear which subset of pid->numbers[] we are freeing. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc :Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel: remove fastcall in kernel/*Harvey Harrison2008-02-081-9/+9
| | | | | | | | [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid: Extend/Fix pid_vnrEric W. Biederman2008-02-081-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | pid_vnr returns the user space pid with respect to the pid namespace the struct pid was allocated in. What we want before we return a pid to user space is the user space pid with respect to the pid namespace of current. pid_vnr is a very nice optimization but because it isn't quite what we want it is easy to use pid_vnr at times when we aren't certain the struct pid was allocated in our pid namespace. Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the parent process reported in the core dumps and the parent process in get_signal_to_deliver. So unless the performance impact is huge having an interface that does what we want instead of always what we want should be much more reliable and much less error prone. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* namespaces: cleanup the code managed with PID_NS optionPavel Emelyanov2008-02-081-181/+3
| | | | | | | | | | | | | | | | | Just like with the user namespaces, move the namespace management code into the separate .c file and mark the (already existing) PID_NS option as "depend on NAMESPACES" [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* gfs2: make gfs2_glock.gl_owner_pid be a struct pid *Pavel Emelyanov2008-02-071-0/+1
| | | | | | | | | | | | | | The gl_owner_pid field is used to get the lock owning task by its pid, so make it in a proper manner, i.e. by using the struct pid pointer and pid_task() function. The pid_task() becomes exported for the gfs2 module. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pidns: Place under CONFIG_EXPERIMENTALEric W. Biederman2007-11-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is my trivial patch to swat innumerable little bugs with a single blow. After some intensive review (my apologies for not having gotten to this sooner) what we have looks like a good base to build on with the current pid namespace code but it is not complete, and it is still much to simple to find issues where the kernel does the wrong thing outside of the initial pid namespace. Until the dust settles and we are certain we have the ABI and the implementation is as correct as humanly possible let's keep process ID namespaces behind CONFIG_EXPERIMENTAL. Allowing us the option of fixing any ABI or other bugs we find as long as they are minor. Allowing users of the kernel to avoid those bugs simply by ensuring their kernel does not have support for multiple pid namespaces. [akpm@linux-foundation.org: coding-style cleanups] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Adrian Bunk <bunk@kernel.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Kir Kolyshkin <kir@swsoft.com> Cc: Kirill Korotaev <dev@sw.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Uninline the task_xid_nr_ns() callsPavel Emelyanov2007-10-191-0/+24
| | | | | | | | | | | | | | | | | | Since these are expanded into call to pid_nr_ns() anyway, it's OK to move the whole routine out-of-line. This is a cheap way to save ~100 bytes from vmlinux. Together with the previous two patches, it saves half-a-kilo from the vmlinux. Un-inline other (currently inlined) functions must be done with additional performance testing. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Uninline find_pid etc set of functionsPavel Emelyanov2007-10-191-0/+12
| | | | | | | | | | | | | | | | | | | | | The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its id, depending on whic id - global or virtual - is used. The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the stack to call another function - find_pid_ns(). It turned out, that this dereference together with the push itself cause the kernel text size to grow too much. Move all these out-of-line. Together with the previous patch this saves a bit less that 400 bytes from .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: remove the struct pid unneeded fieldsPavel Emelyanov2007-10-191-1/+0
| | | | | | | | | | | | | Since we've switched from using pid->nr to pid->upids->nr some fields on struct pid are no longer needed Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Uninline find_task_by_xxx set of functionsPavel Emelyanov2007-10-191-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | The find_task_by_something is a set of macros are used to find task by pid depending on what kind of pid is proposed - global or virtual one. All of them are wrappers above the most generic one - find_task_by_pid_type_ns() - and just substitute some args for it. It turned out, that dereferencing the current->nsproxy->pid_ns construction and pushing one more argument on the stack inline cause kernel text size to grow. This patch moves all this stuff out-of-line into kernel/pid.c. Together with the next patch it saves a bit less than 400 bytes from the .text section. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: destroy pid namespace on init's deathSukadev Bhattiprolu2007-10-191-0/+38
| | | | | | | | | | | | | | | Terminate all processes in a namespace when the reaper of the namespace is exiting. We do this by walking the pidmap of the namespace and sending SIGKILL to all processes. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: create a slab-cache for 'struct pid_namespace'Sukadev Bhattiprolu2007-10-191-3/+6
| | | | | | | | | | | | This will help fixing memory leaks due to bad reference counting. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: allow cloning of new namespacePavel Emelyanov2007-10-191-6/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then create a new struct pid for the new process. Allocate pid_t's for the new process in the new pid namespace and all ancestor pid namespaces. Make the newly cloned process the session and process group leader. Since the active pid namespace is special and expected to be the first entry in pid->upid_list, preserve the order of pid namespaces. The size of 'struct pid' is dependent on the the number of pid namespaces the process exists in, so we use multiple pid-caches'. Only one pid cache is created during system startup and this used by processes that exist only in init_pid_ns. When a process clones its pid namespace, we create additional pid caches as necessary and use the pid cache to allocate 'struct pids' for that depth. Note, that with this patch the newly created namespace won't work, since the rest of the kernel still uses global pids, but this is to be fixed soon. Init pid namespace still works. [oleg@tv-sign.ru: merge fix] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: miscellaneous preparations for pid namespacesPavel Emelyanov2007-10-191-7/+14
| | | | | | | | | | | | | | | * remove pid.h from pid_namespaces.h; * rework is_(cgroup|global)_init; * optimize (get|put)_pid_ns for init_pid_ns; * declare task_child_reaper to return actual reaper. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: helpers to find the task by its numerical idsPavel Emelyanov2007-10-191-17/+26
| | | | | | | | | | | | | | | | When searching the task by numerical id on may need to find it using global pid (as it is done now in kernel) or by its virtual id, e.g. when sending a signal to a task from one namespace the sender will specify the task's virtual id and we should find the task by this value. [akpm@linux-foundation.org: fix gfs2 linkage] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pid namespaces: helpers to obtain pid numbersPavel Emelyanov2007-10-191-0/+13
| | | | | | | | | | | | | | | | | When showing pid to user or getting the pid numerical id for in-kernel use the value of this id may differ depending on the namespace. This set of helpers is used to get the global pid nr, the virtual (i.e. seen by task in its namespace) nr and the nr as it is seen from the specified namespace. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>